On the Right Track! Analysing and Predicting Navigation Success in Wikipedia.Koopmann, Tobias; Dallmann, Alexander; Hettinger, Lena; Niebler, Thomas; Hotho, Andreas in HT '19, C. Atzenbeck, Rubart, J., Millard, D. E. (eds.) (2019). 143--152.
Flow-based network traffic generation using Generative Adversarial Networks
Flow-based data sets are necessary for evaluating network-based intrusion detection systems (NIDS). In this work, we propose a novel methodology for generating realistic flow-based network traffic. Our approach is based on Generative Adversarial Networks (GANs) which achieve good results for image generation. A major challenge lies in the fact that GANs can only process continuous attributes. However, flow-based data inevitably contain categorical attributes such as IP addresses or port numbers. Therefore, we propose three different preprocessing approaches for flow-based data in order to transform them into continuous values. Further, we present a new method for evaluating the generated flow-based network traffic which uses domain knowledge to define quality tests. We use the three approaches for generating flow-based network traffic based on the CIDDS-001 data set. Experiments indicate that two of the three approaches are able to generate high quality data.
Flow-based data sets are necessary for evaluating network-based intrusion detection systems (NIDS). In this work, we propose a novel methodology for generating realistic flow-based network traffic. Our approach is based on Generative Adversarial Networks (GANs) which achieve good results for image generation. A major challenge lies in the fact that GANs can only process continuous attributes. However, flow-based data inevitably contain categorical attributes such as IP addresses or port numbers. Therefore, we propose three different preprocessing approaches for flow-based data in order to transform them into continuous values. Further, we present a new method for evaluating the generated flow-based network traffic which uses domain knowledge to define quality tests. We use the three approaches for generating flow-based network traffic based on the CIDDS-001 data set. Experiments indicate that two of the three approaches are able to generate high quality data.
EClaiRE: Context Matters! – Comparing Word Embeddings for Relation Classification.Hettinger, Lena; Zehe, Albin; Dallmann, Alexander; Hotho, Andreas K. David, Geihs, K., Lange, M., Stumme, G. (eds.) (2019). 191-204.
Detection of slow port scans in flow-based network traffic
Frequently, port scans are early indicators of more serious attacks. Unfortunately, the detection of slow port scans in company networks is challenging due to the massive amount of network data. This paper proposes an innovative approach for preprocessing flow-based data which is specifically tailored to the detection of slow port scans. The preprocessing chain generates new objects based on flow-based data aggregated over time windows while taking domain knowledge as well as additional knowledge about the network structure into account. The computed objects are used as input for the further analysis. Based on these objects, we propose two different approaches for detection of slow port scans. One approach is unsupervised and uses sequential hypothesis testing whereas the other approach is supervised and uses classification algorithms. We compare both approaches with existing port scan detection algorithms on the flow-based CIDDS-001 data set. Experiments indicate that the proposed approaches achieve better detection rates and exhibit less false alarms than similar algorithms.
Frequently, port scans are early indicators of more serious attacks. Unfortunately, the detection of slow port scans in company networks is challenging due to the massive amount of network data. This paper proposes an innovative approach for preprocessing flow-based data which is specifically tailored to the detection of slow port scans. The preprocessing chain generates new objects based on flow-based data aggregated over time windows while taking domain knowledge as well as additional knowledge about the network structure into account. The computed objects are used as input for the further analysis. Based on these objects, we propose two different approaches for detection of slow port scans. One approach is unsupervised and uses sequential hypothesis testing whereas the other approach is supervised and uses classification algorithms. We compare both approaches with existing port scan detection algorithms on the flow-based CIDDS-001 data set. Experiments indicate that the proposed approaches achieve better detection rates and exhibit less false alarms than similar algorithms.
ClaiRE at SemEval-2018 Task 7: Classification of Relations using Embeddings.Hettinger, Lena; Dallmann, Alexander; Zehe, Albin; Niebler, Thomas; Hotho, Andreas (2018).
Accessing Information with Tags: Search and Ranking.Navarro Bullock, Beate; Hotho, Andreas; Stumme, Gerd in Social Information Access: Systems and Technologies, P. Brusilovsky, He, D. (eds.) (2018). 310--343.
Accessing Information with Tags: Search and Ranking
With the growth of the Social Web, a variety of new web-based services arose and changed the way users interact with the internet and consume information. One central phenomenon was and is tagging which allows to manage, organize and access information in social systems. Tagging helps to manage all kinds of resources, making their access much easier. The first type of social tagging systems were social bookmarking systems, i.e., platforms for storing and sharing bookmarks on the web rather than just in the browser. Meanwhile, (hash-)tagging is central in many other Social Media systems such as social networking sites and micro-blogging platforms. To allow for efficient information access, special algorithms have been developed to guide the user, to search for information and to rank the content based on tagging information contributed by the users.
With the growth of the Social Web, a variety of new web-based services arose and changed the way users interact with the internet and consume information. One central phenomenon was and is tagging which allows to manage, organize and access information in social systems. Tagging helps to manage all kinds of resources, making their access much easier. The first type of social tagging systems were social bookmarking systems, i.e., platforms for storing and sharing bookmarks on the web rather than just in the browser. Meanwhile, (hash-)tagging is central in many other Social Media systems such as social networking sites and micro-blogging platforms. To allow for efficient information access, special algorithms have been developed to guide the user, to search for information and to rank the content based on tagging information contributed by the users.
Flow-based Network Traffic Generation using Generative Adversarial Networks.Ring, Markus; Schlör, Daniel; Landes, Dieter; Hotho, Andreas in CoRR (2018). abs/1810.07795
Flow-based Network Traffic Generation using Generative Adversarial Networks.
Flow-based data sets are necessary for evaluating network-based intrusion de- tection systems (NIDS). In this work, we propose a novel methodology for gener- ating realistic flow-based network traffic. Our approach is based on Generative Adversarial Networks (GANs) which achieve good results for image generation. A major challenge lies in the fact that GANs can only process continuous at- tributes. However, flow-based data inevitably contain categorical attributes such as IP addresses or port numbers. Therefore, we propose three different preprocessing approaches for flow-based data in order to transform them into continuous values. Further, we present a new method for evaluating the gener- ated flow-based network traffic which uses domain knowledge to define quality tests. We use the three approaches for generating flow-based network traffic based on the CIDDS-001 data set. Experiments indicate that two of the three approaches are able to generate high quality data.
Flow-based data sets are necessary for evaluating network-based intrusion de- tection systems (NIDS). In this work, we propose a novel methodology for gener- ating realistic flow-based network traffic. Our approach is based on Generative Adversarial Networks (GANs) which achieve good results for image generation. A major challenge lies in the fact that GANs can only process continuous at- tributes. However, flow-based data inevitably contain categorical attributes such as IP addresses or port numbers. Therefore, we propose three different preprocessing approaches for flow-based data in order to transform them into continuous values. Further, we present a new method for evaluating the gener- ated flow-based network traffic which uses domain knowledge to define quality tests. We use the three approaches for generating flow-based network traffic based on the CIDDS-001 data set. Experiments indicate that two of the three approaches are able to generate high quality data.
Big-Data Helps SDN to Improve Application Specific Quality of Service.Schwarzmann, Susanna; Blenk, Andreas; Dobrijevic, Ognjen; Jarschel, Michael; Hotho, Andreas; Zinner, Thomas; Wamser, Florian in Big Data and Software Defined Networks (2018).
Big-Data Helps SDN to Improve Application Specific Quality of Service
This chapter first provides an outline of the current results in the domains of: (a) quality-of-service (QoS) / quality-of-experience (QoE) control and management (CaM) for real-time multimedia services that is supported by software-defined networking (SDN), and (b) big data analytics and methods that are used for QoS/QoE CaM. Then, three specific use case scenarios with respect to video streaming services are presented, so as to illustrate the expected benefits of incorporating big data analytics into SDN-based CaM for the purposes of improving or optimizing QoS/QoE. In the end, we describe our vision and a high-level view of an SDN-based architecture for QoS/QoE CaM that is enriched with big data analytics' functional blocks and summarize corresponding challenges.
This chapter first provides an outline of the current results in the domains of: (a) quality-of-service (QoS) / quality-of-experience (QoE) control and management (CaM) for real-time multimedia services that is supported by software-defined networking (SDN), and (b) big data analytics and methods that are used for QoS/QoE CaM. Then, three specific use case scenarios with respect to video streaming services are presented, so as to illustrate the expected benefits of incorporating big data analytics into SDN-based CaM for the purposes of improving or optimizing QoS/QoE. In the end, we describe our vision and a high-level view of an SDN-based architecture for QoS/QoE CaM that is enriched with big data analytics' functional blocks and summarize corresponding challenges.
EveryAware Gears: A Tool to visualize and analyze all types of Citizen Science Data.Lautenschlager, Florian; Becker, Martin; Steininger, Michael; Hotho, Andreas D. Burghardt, Chen, S., Andrienko, G., Andrienko, N., Purves, R., Diehl, A. (eds.) (2018).
A White-Box Model for Detecting Author Nationality by Linguistic Differences in Spanish Novels.Zehe, Albin; Schlör, Daniel; Henny-Krahmer, Ulrike; Becker, Martin; Hotho, Andreas (2018).
Learning Semantic Relatedness from Human Feedback Using Relative Relatedness Learning.Niebler, Thomas; Becker, Martin; Pölitz, Christian; Hotho, Andreas (2017).
A Bayesian Method for Comparing Hypotheses About Human Trails.Singer, Philipp; Helic, Denis; Hotho, Andreas; Strohmaier, Markus in ACM Trans. Web (2017). 11(3) 14:1--14:29.
A Bayesian Method for Comparing Hypotheses About Human Trails
When users interact with the Web today, they leave sequential digital trails on a massive scale. Examples of such human trails include Web navigation, sequences of online restaurant reviews, or online music play lists. Understanding the factors that drive the production of these trails can be useful, for example, for improving underlying network structures, predicting user clicks, or enhancing recommendations. In this work, we present a method called HypTrails for comparing a set of hypotheses about human trails on the Web, where hypotheses represent beliefs about transitions between states. Our method utilizes Markov chain models with Bayesian inference. The main idea is to incorporate hypotheses as informative Dirichlet priors and to calculate the evidence of the data under them. For eliciting Dirichlet priors from hypotheses, we present an adaption of the so-called (trial) roulette method, and to compare the relative plausibility of hypotheses, we employ Bayes factors. We demonstrate the general mechanics and applicability of HypTrails by performing experiments with (i) synthetic trails for which we control the mechanisms that have produced them and (ii) empirical trails stemming from different domains including Web site navigation, business reviews, and online music played. Our work expands the repertoire of methods available for studying human trails.
When users interact with the Web today, they leave sequential digital trails on a massive scale. Examples of such human trails include Web navigation, sequences of online restaurant reviews, or online music play lists. Understanding the factors that drive the production of these trails can be useful, for example, for improving underlying network structures, predicting user clicks, or enhancing recommendations. In this work, we present a method called HypTrails for comparing a set of hypotheses about human trails on the Web, where hypotheses represent beliefs about transitions between states. Our method utilizes Markov chain models with Bayesian inference. The main idea is to incorporate hypotheses as informative Dirichlet priors and to calculate the evidence of the data under them. For eliciting Dirichlet priors from hypotheses, we present an adaption of the so-called (trial) roulette method, and to compare the relative plausibility of hypotheses, we employ Bayes factors. We demonstrate the general mechanics and applicability of HypTrails by performing experiments with (i) synthetic trails for which we control the mechanisms that have produced them and (ii) empirical trails stemming from different domains including Web site navigation, business reviews, and online music played. Our work expands the repertoire of methods available for studying human trails.
MixedTrails: Bayesian hypothesis comparison on heterogeneous sequential data.Becker, Martin; Lemmerich, Florian; Singer, Philipp; Strohmaier, Markus; Hotho, Andreas in Data Mining and Knowledge Discovery (2017).
MixedTrails: Bayesian hypothesis comparison on heterogeneous sequential data
Sequential traces of user data are frequently observed online and offline, e.g., as sequences of visited websites or as sequences of locations captured by GPS. However, understanding factors explaining the production of sequence data is a challenging task, especially since the data generation is often not homogeneous. For example, navigation behavior might change in different phases of browsing a website or movement behavior may vary between groups of users. In this work, we tackle this task and propose MixedTrails , a Bayesian approach for comparing the plausibility of hypotheses regarding the generative processes of heterogeneous sequence data. Each hypothesis is derived from existing literature, theory, or intuition and represents a belief about transition probabilities between a set of states that can vary between groups of observed transitions. For example, when trying to understand human movement in a city and given some data, a hypothesis assuming tourists to be more likely to move towards points of interests than locals can be shown to be more plausible than a hypothesis assuming the opposite. Our approach incorporates such hypotheses as Bayesian priors in a generative mixed transition Markov chain model, and compares their plausibility utilizing Bayes factors. We discuss analytical and approximate inference methods for calculating the marginal likelihoods for Bayes factors, give guidance on interpreting the results, and illustrate our approach with several experiments on synthetic and empirical data from Wikipedia and Flickr. Thus, this work enables a novel kind of analysis for studying sequential data in many application areas.
Sequential traces of user data are frequently observed online and offline, e.g., as sequences of visited websites or as sequences of locations captured by GPS. However, understanding factors explaining the production of sequence data is a challenging task, especially since the data generation is often not homogeneous. For example, navigation behavior might change in different phases of browsing a website or movement behavior may vary between groups of users. In this work, we tackle this task and propose MixedTrails , a Bayesian approach for comparing the plausibility of hypotheses regarding the generative processes of heterogeneous sequence data. Each hypothesis is derived from existing literature, theory, or intuition and represents a belief about transition probabilities between a set of states that can vary between groups of observed transitions. For example, when trying to understand human movement in a city and given some data, a hypothesis assuming tourists to be more likely to move towards points of interests than locals can be shown to be more plausible than a hypothesis assuming the opposite. Our approach incorporates such hypotheses as Bayesian priors in a generative mixed transition Markov chain model, and compares their plausibility utilizing Bayes factors. We discuss analytical and approximate inference methods for calculating the marginal likelihoods for Bayes factors, give guidance on interpreting the results, and illustrate our approach with several experiments on synthetic and empirical data from Wikipedia and Flickr. Thus, this work enables a novel kind of analysis for studying sequential data in many application areas.
Leveraging User-Interactions for Time-Aware Tag Recommendations.Zoller, Daniel; Doerfel, Stephan; Pölitz, Christian; Hotho, Andreas in CEUR Workshop Proceedings (2017).
A Toolset for Intrusion and Insider Threat Detection.Ring, Markus; Wunderlich, Sarah; Grüdl, Dominik; Landes, Dieter; Hotho, Andreas in Data Analytics and Decision Support for Cybersecurity: Trends, Methodologies and Applications, I. Palomares Carrascosa, Kalutarage, H. K., Huang, Y. (eds.) (2017). 3--31.
A Toolset for Intrusion and Insider Threat Detection
Company data are a valuable asset and must be protected against unauthorized access and manipulation. In this contribution, we report on our ongoing work that aims to support IT security experts with identifying novel or obfuscated attacks in company networks, irrespective of their origin inside or outside the company network. A new toolset for anomaly based network intrusion detection is proposed. This toolset uses flow-based data which can be easily retrieved by central network components. We study the challenges of analysing flow-based data streams using data mining algorithms and build an appropriate approach step by step. In contrast to previous work, we collect flow-based data for each host over a certain time window, include the knowledge of domain experts and analyse the data from three different views. We argue that incorporating expert knowledge and previous flows allow us to create more meaningful attributes for subsequent analysis methods. This way, we try to detect novel attacks while simultaneously limiting the number of false positives.
Weitere Informationen
Herausgeber
Palomares Carrascosa, Iván and Kalutarage, Harsha Kumara and Huang, Yan
Company data are a valuable asset and must be protected against unauthorized access and manipulation. In this contribution, we report on our ongoing work that aims to support IT security experts with identifying novel or obfuscated attacks in company networks, irrespective of their origin inside or outside the company network. A new toolset for anomaly based network intrusion detection is proposed. This toolset uses flow-based data which can be easily retrieved by central network components. We study the challenges of analysing flow-based data streams using data mining algorithms and build an appropriate approach step by step. In contrast to previous work, we collect flow-based data for each host over a certain time window, include the knowledge of domain experts and analyse the data from three different views. We argue that incorporating expert knowledge and previous flows allow us to create more meaningful attributes for subsequent analysis methods. This way, we try to detect novel attacks while simultaneously limiting the number of false positives.
Learning Word Embeddings from Tagging Data: A methodological comparison.Niebler, Thomas; Hahn, Luzian; Hotho, Andreas (2017).
Learning Semantic Relatedness From Human Feedback Using Metric Learning
Assessing the degree of semantic relatedness between words is an important task with a variety of semantic applications, such as ontology learning for the Semantic Web, semantic search or query expansion. To accomplish this in an automated fashion, many relatedness measures have been proposed. However, most of these metrics only encode information contained in the underlying corpus and thus do not directly model human intuition. To solve this, we propose to utilize a metric learning approach to improve existing semantic relatedness measures by learning from additional information, such as explicit human feedback. For this, we argue to use word embeddings instead of traditional high-dimensional vector representations in order to leverage their semantic density and to reduce computational cost. We rigorously test our approach on several domains including tagging data as well as publicly available embeddings based on Wikipedia texts and navigation. Human feedback about semantic relatedness for learning and evaluation is extracted from publicly available datasets such as MEN or WS-353. We find that our method can significantly improve semantic relatedness measures by learning from additional information, such as explicit human feedback. For tagging data, we are the first to generate and study embeddings. Our results are of special interest for ontology and recommendation engineers, but also for any other researchers and practitioners of Semantic Web techniques.
Assessing the degree of semantic relatedness between words is an important task with a variety of semantic applications, such as ontology learning for the Semantic Web, semantic search or query expansion. To accomplish this in an automated fashion, many relatedness measures have been proposed. However, most of these metrics only encode information contained in the underlying corpus and thus do not directly model human intuition. To solve this, we propose to utilize a metric learning approach to improve existing semantic relatedness measures by learning from additional information, such as explicit human feedback. For this, we argue to use word embeddings instead of traditional high-dimensional vector representations in order to leverage their semantic density and to reduce computational cost. We rigorously test our approach on several domains including tagging data as well as publicly available embeddings based on Wikipedia texts and navigation. Human feedback about semantic relatedness for learning and evaluation is extracted from publicly available datasets such as MEN or WS-353. We find that our method can significantly improve semantic relatedness measures by learning from additional information, such as explicit human feedback. For tagging data, we are the first to generate and study embeddings. Our results are of special interest for ontology and recommendation engineers, but also for any other researchers and practitioners of Semantic Web techniques.
Sedentary Behavior among National Elite Rowers during Off-Training—A Pilot Study.Sperlich, Billy; Becker, Martin; Hotho, Andreas; Wallmann-Sperlich, Birgit; Sareban, Mahdi; Winkert, Kay; Steinacker, Jürgen M.; Treff, Gunnar in Frontiers in Physiology (2017). 8 655.
Sedentary Behavior among National Elite Rowers during Off-Training—A Pilot Study
The aim of this pilot study was to analyze the off-training physical activity (PA) profile in national elite German U23 rowers during 31 days of their preparation period. The hours spent in each PA category (i.e. sedentary: <1.5 MET; light physical activity: 1.5–3 MET; moderate physical activity: 3–6 MET and vigorous intense physical activity: >6 MET) were calculated for every valid day (i.e. > 480 min of wear time). The off-training PA during 21 weekdays and 10 weekend days of the final 11-wk preparation period was assessed by a wrist-worn multisensory device (Microsoft Band II (MSBII)). A total of 11 rowers provided valid data (i.e. > 480 min/day) for 11.6 week days and 4.8 weekend days during the 31 days observation period. The average sedentary time was 11.63±1.25 hours per day during the week and 12.49±1.10 hours per day on the weekend, with a tendency to be higher on the weekend compared to weekdays (p = 0.06; d = 0.73). The average time in light, moderate and vigorous PA during the weekdays was 1.27±1.15, 0.76±0.37, 0.51±0.44 hours per day and 0.67±0.43, 0.59±0.37, 0.53±0.32 hours per weekend day. Light physical activity was higher during weekdays compared to the weekend (p = 0.04; d = 0.69) Based on our pilot study of eleven national elite rowers we conclude that rowers display a considerable sedentary off-training behavior of more than 11.5 hours/day.
The aim of this pilot study was to analyze the off-training physical activity (PA) profile in national elite German U23 rowers during 31 days of their preparation period. The hours spent in each PA category (i.e. sedentary: <1.5 MET; light physical activity: 1.5–3 MET; moderate physical activity: 3–6 MET and vigorous intense physical activity: >6 MET) were calculated for every valid day (i.e. > 480 min of wear time). The off-training PA during 21 weekdays and 10 weekend days of the final 11-wk preparation period was assessed by a wrist-worn multisensory device (Microsoft Band II (MSBII)). A total of 11 rowers provided valid data (i.e. > 480 min/day) for 11.6 week days and 4.8 weekend days during the 31 days observation period. The average sedentary time was 11.63±1.25 hours per day during the week and 12.49±1.10 hours per day on the weekend, with a tendency to be higher on the weekend compared to weekdays (p = 0.06; d = 0.73). The average time in light, moderate and vigorous PA during the weekdays was 1.27±1.15, 0.76±0.37, 0.51±0.44 hours per day and 0.67±0.43, 0.59±0.37, 0.53±0.32 hours per weekend day. Light physical activity was higher during weekdays compared to the weekend (p = 0.04; d = 0.69) Based on our pilot study of eleven national elite rowers we conclude that rowers display a considerable sedentary off-training behavior of more than 11.5 hours/day.
IP2Vec: Learning Similarities Between IP Addresses.Ring, Markus; Landes, Dieter; Dallmann, Alexander; Hotho, Andreas in 2017 IEEE International Conference on Data Mining Workshops (ICDMW) (2017). 657-666.
Comparing Hypotheses About Sequential Data: A Bayesian Approach and Its Applications.Lemmerich, Florian; Singer, Philipp; Becker, Martin; Espin-Noboa, Lisette; Dimitrov, Dimitar; Helic, Denis; Hotho, Andreas; Strohmaier, Markus Y. Altun, Das, K., Mielikäinen, T., Malerba, D., Stefanowski, J., Read, J., vZitnik, M., Ceci, M., Dvzeroski, S. (eds.) (2017). 354--357.
Comparing Hypotheses About Sequential Data: A Bayesian Approach and Its Applications
Sequential data can be found in many settings, e.g., as sequences of visited websites or as location sequences of travellers. To improve the understanding of the underlying mechanisms that generate such sequences, the HypTrails approach provides for a novel data analysis method. Based on first-order Markov chain models and Bayesian hypothesis testing, it allows for comparing a set of hypotheses, i.e., beliefs about transitions between states, with respect to their plausibility considering observed data. HypTrails has been successfully employed to study phenomena in the online and the offline world. In this talk, we want to give an introduction to HypTrails and showcase selected real-world applications on urban mobility and reading behavior on Wikipedia.
Weitere Informationen
Herausgeber
Altun, Yasemin and Das, Kamalika and Mielikäinen, Taneli and Malerba, Donato and Stefanowski, Jerzy and Read, Jesse and vZitnik, Marinka and Ceci, Michelangelo and Dvzeroski, Savso
Sequential data can be found in many settings, e.g., as sequences of visited websites or as location sequences of travellers. To improve the understanding of the underlying mechanisms that generate such sequences, the HypTrails approach provides for a novel data analysis method. Based on first-order Markov chain models and Bayesian hypothesis testing, it allows for comparing a set of hypotheses, i.e., beliefs about transitions between states, with respect to their plausibility considering observed data. HypTrails has been successfully employed to study phenomena in the online and the offline world. In this talk, we want to give an introduction to HypTrails and showcase selected real-world applications on urban mobility and reading behavior on Wikipedia.
Flow-based benchmark data sets for intrusion detection.Ring, Markus; Wunderlich, Sarah; Grüdl, Dominik; Landes, Dieter; Hotho, Andreas (2017). 361--369.
Creation of Flow-Based Data Sets for Intrusion Detection.Ring, Markus; Wunderlich, Sarah; Grüdl, Dominik; Landes, Dieter; Hotho, Andreas in Journal of Information Warfare (2017). 16(4) 41-54.
Creation of Flow-Based Data Sets for Intrusion Detection
Publicly available labelled data sets are necessary for evaluating anomaly-based Intrusion Detection Systems (IDS). However, existing data sets are often not up-to-date or not yet published because of privacy concerns. This paper identifies requirements for good data sets and proposes an approach for their generation. The key idea is to use a test environment and emulate realistic user behaviour with parameterised scripts on the clients. Comprehensive logging mechanisms provide additional information which may be used for a better understanding of the inner dynamics of an IDS. Finally, the proposed approach is used to generate the flow-based CIDDS-002 data set.
Publicly available labelled data sets are necessary for evaluating anomaly-based Intrusion Detection Systems (IDS). However, existing data sets are often not up-to-date or not yet published because of privacy concerns. This paper identifies requirements for good data sets and proposes an approach for their generation. The key idea is to use a test environment and emulate realistic user behaviour with parameterised scripts on the clients. Comprehensive logging mechanisms provide additional information which may be used for a better understanding of the inner dynamics of an IDS. Finally, the proposed approach is used to generate the flow-based CIDDS-002 data set.
Towards Sentiment Analysis on German Literature.Zehe, Albin; Becker, Martin; Jannidis, Fotis; Hotho, Andreas G. Kern-Isberner, Fürnkranz, J., Thimm, M. (eds.) (2017). 387--394.
Sentiment Analysis is a Natural Language Processing-task that is relevant in a number of contexts, including the analysis of literature. We report on ongoing research towards enabling, for the first time, sentence-level Sentiment Analysis in the domain of German novels. We create a labelled dataset from sentences extracted from German novels and, by adapting existing sentiment classifiers, reach promising F1-scores of 0.67 for binary polarity classification.
Weitere Informationen
Herausgeber
Kern-Isberner, Gabriele and Fürnkranz, Johannes and Thimm, Matthias
Sentiment Analysis is a Natural Language Processing-task that is relevant in a number of contexts, including the analysis of literature. We report on ongoing research towards enabling, for the first time, sentence-level Sentiment Analysis in the domain of German novels. We create a labelled dataset from sentences extracted from German novels and, by adapting existing sentiment classifiers, reach promising F1-scores of 0.67 for binary polarity classification.
Improving Session Recommendation with Recurrent Neural Networks by Exploiting Dwell Time.Dallmann, Alexander; Grimm, Alexander; Pölitz, Christian; Zoller, Daniel; Hotho, Andreas (2017).
Improving Session Recommendation with Recurrent Neural Networks by Exploiting Dwell Time
Recently, Recurrent Neural Networks (RNNs) have been applied to the task of session-based recommendation. These approaches use RNNs to predict the next item in a user session based on the previ- ously visited items. While some approaches consider additional item properties, we argue that item dwell time can be used as an implicit measure of user interest to improve session-based item recommen- dations. We propose an extension to existing RNN approaches that captures user dwell time in addition to the visited items and show that recommendation performance can be improved. Additionally, we investigate the usefulness of a single validation split for model selection in the case of minor improvements and find that in our case the best model is not selected and a fold-like study with different validation sets is necessary to ensure the selection of the best model.
Recently, Recurrent Neural Networks (RNNs) have been applied to the task of session-based recommendation. These approaches use RNNs to predict the next item in a user session based on the previ- ously visited items. While some approaches consider additional item properties, we argue that item dwell time can be used as an implicit measure of user interest to improve session-based item recommen- dations. We propose an extension to existing RNN approaches that captures user dwell time in addition to the visited items and show that recommendation performance can be improved. Additionally, we investigate the usefulness of a single validation split for model selection in the case of minor improvements and find that in our case the best model is not selected and a fold-like study with different validation sets is necessary to ensure the selection of the best model.
Prediction of Happy Endings in German Novels.Zehe, Albin; Becker, Martin; Hettinger, Lena; Hotho, Andreas; Reger, Isabella; Jannidis, Fotis in CEUR Workshop Proceedings, P. Cellier, Charnois, T., Hotho, A., Matwin, S., Moens, M. -F., Toussaint, Y. (eds.) (2016). (Vol. 1646) 9-16.
Comparison of non-invasive individual monitoring of the training and health of athletes with commercially available wearable technologies.Düking, Peter; Hotho, Andreas; Fuss, Franz Konstantin; Holmberg, Hans-Christer; Sperlich, Billy in Frontiers in Physiology (2016). 7(71)
Comparison of non-invasive individual monitoring of the training and health of athletes with commercially available wearable technologies
Athletes adapt their training daily to optimize performance, as well as avoid fatigue, overtraining and other undesirable effects on their health. To optimize training load, each athlete must take his/her own personal objective and subjective characteristics into consideration and an increasing number of wearable technologies (wearables) provide convenient monitoring of various parameters. Accordingly, it is important to help athletes decide which parameters are of primary interest and which wearables can monitor these parameters most effectively. Here, we discuss the wearable technologies available for non-invasive monitoring of various parameters concerning an athlete's training and health. On the basis of these considerations, we suggest directions for future development. Furthermore, we propose that a combination of several wearables is most effective for accessing all relevant parameters, disturbing the athlete as little as possible, and optimizing performance and promoting health.
Athletes adapt their training daily to optimize performance, as well as avoid fatigue, overtraining and other undesirable effects on their health. To optimize training load, each athlete must take his/her own personal objective and subjective characteristics into consideration and an increasing number of wearable technologies (wearables) provide convenient monitoring of various parameters. Accordingly, it is important to help athletes decide which parameters are of primary interest and which wearables can monitor these parameters most effectively. Here, we discuss the wearable technologies available for non-invasive monitoring of various parameters concerning an athlete's training and health. On the basis of these considerations, we suggest directions for future development. Furthermore, we propose that a combination of several wearables is most effective for accessing all relevant parameters, disturbing the athlete as little as possible, and optimizing performance and promoting health.
Posted, Visited, Exported: Altmetrics in the Social Tagging System BibSonomy.Zoller, Daniel; Doerfel, Stephan; Jäschke, Robert; Stumme, Gerd; Hotho, Andreas in Journal of Informetrics (2016). 10(3) 732 - 749.
Posted, Visited, Exported: Altmetrics in the Social Tagging System BibSonomy
In social tagging systems, like Mendeley, CiteULike, and BibSonomy, users can post, tag, visit, or export scholarly publications. In this paper, we compare citations with metrics derived from users’ activities (altmetrics) in the popular social bookmarking system BibSonomy. Our analysis, using a corpus of more than 250,000 publications published before 2010, reveals that overall, citations and altmetrics in BibSonomy are mildly correlated. Furthermore, grouping publications by user-generated tags results in topic-homogeneous subsets that exhibit higher correlations with citations than the full corpus. We find that posts, exports, and visits of publications are correlated with citations and even bear predictive power over future impact. Machine learning classifiers predict whether the number of citations that a publication receives in a year exceeds the median number of citations in that year, based on the usage counts of the preceding year. In that setup, a Random Forest predictor outperforms the baseline on average by seven percentage points.
In social tagging systems, like Mendeley, CiteULike, and BibSonomy, users can post, tag, visit, or export scholarly publications. In this paper, we compare citations with metrics derived from users’ activities (altmetrics) in the popular social bookmarking system BibSonomy. Our analysis, using a corpus of more than 250,000 publications published before 2010, reveals that overall, citations and altmetrics in BibSonomy are mildly correlated. Furthermore, grouping publications by user-generated tags results in topic-homogeneous subsets that exhibit higher correlations with citations than the full corpus. We find that posts, exports, and visits of publications are correlated with citations and even bear predictive power over future impact. Machine learning classifiers predict whether the number of citations that a publication receives in a year exceeds the median number of citations in that year, based on the usage counts of the preceding year. In that setup, a Random Forest predictor outperforms the baseline on average by seven percentage points.
What Users Actually do in a Social Tagging System: A Study of User Behavior in BibSonomy.Doerfel, Stephan; Zoller, Daniel; Singer, Philipp; Niebler, Thomas; Hotho, Andreas; Strohmaier, Markus in ACM Transactions on the Web (2016). 10(2) 14:1--14:32.
What Users Actually do in a Social Tagging System: A Study of User Behavior in BibSonomy
Social tagging systems have established themselves as an important part in today’s web and have attracted the interest of our research community in a variety of investigations. Henceforth, several aspects of social tagging systems have been discussed and assumptions have emerged on which our community builds their work. Yet, testing such assumptions has been difficult due to the absence of suitable usage data in the past. In this work, we thoroughly investigate and evaluate four aspects about tagging systems, covering social interaction, retrieval of posted resources, the importance of the three different types of entities, users, resources, and tags, as well as connections between these entities’ popularity in posted and in requested content. For that purpose, we examine live server log data gathered from the real-world, public social tagging system BibSonomy. Our empirical results paint a mixed picture about the four aspects. While for some, typical assumptions hold to a certain extent, other aspects need to be reflected in a very critical light. Our observations have implications for the understanding of social tagging systems, and the way they are used on the web. We make the dataset used in this work available to other researchers.
Social tagging systems have established themselves as an important part in today’s web and have attracted the interest of our research community in a variety of investigations. Henceforth, several aspects of social tagging systems have been discussed and assumptions have emerged on which our community builds their work. Yet, testing such assumptions has been difficult due to the absence of suitable usage data in the past. In this work, we thoroughly investigate and evaluate four aspects about tagging systems, covering social interaction, retrieval of posted resources, the importance of the three different types of entities, users, resources, and tags, as well as connections between these entities’ popularity in posted and in requested content. For that purpose, we examine live server log data gathered from the real-world, public social tagging system BibSonomy. Our empirical results paint a mixed picture about the four aspects. While for some, typical assumptions hold to a certain extent, other aspects need to be reflected in a very critical light. Our observations have implications for the understanding of social tagging systems, and the way they are used on the web. We make the dataset used in this work available to other researchers.
Significance Testing for the Classification of Literary Subgenres.Hettinger, Lena; Jannidis, Fotis; Reger, Isabella; Hotho, Andreas (2016).
Proceedings of the Workshop on Interactions between Data Mining and Natural Language Processing, DMNLP 2016, co-located with the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML-PKDD 2016, Riva del Garda, Italy, September 23, 2016.Cellier, Peggy; Charnois, Thierry; Hotho, Andreas; Matwin, Stan; Moens, Marie-Francine; Toussaint, Yannick in CEUR Workshop Proceedings (2016). (Vol. 1646) CEUR-WS.org.
Proceedings of the Workshop on Interactions between Data Mining and Natural Language Processing, DMNLP 2016, co-located with the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, ECML-PKDD 2016, Riva del Garda, Italy, September 23, 2016.
Weitere Informationen
Herausgeber
Cellier, Peggy and Charnois, Thierry and Hotho, Andreas and Matwin, Stan and Moens, Marie-Francine and Toussaint, Yannick
Extracting Semantics from Random Walks on Wikipedia: Comparing learning and counting methods.Dallmann, Alexander; Niebler, Thomas; Lemmerich, Florian; Hotho, Andreas R. West, Zia, L., Taraborelli, D., Leskovec, J. (eds.) (2016).
Extracting Semantics from Random Walks on Wikipedia: Comparing learning and counting methods
Semantic relatedness between words has been extracted from a variety of sources. In this ongoing work, we explore and compare several options for determining if semantic relatedness can be extracted from navigation structures in Wikipedia. In that direction, we first investigate the potential of representation learning techniques such as DeepWalk in comparison to previously applied methods based on counting co-occurrences. Since both methods are based on (random) paths in the network, we also study different approaches to generate paths from Wikipedia link structure. For this task, we do not only consider the link structure of Wikipedia, but also actual navigation behavior of users. Finally, we analyze if semantics can also be extracted from smaller subsets of the Wikipedia link network. As a result we find that representa- tion learning techniques mostly outperform the investigated co-occurrence counting methods on the Wikipedia network. However, we find that this is not the case for paths sampled from human navigation behavior.
Weitere Informationen
Herausgeber
West, Robert and Zia, Leila and Taraborelli, Dario and Leskovec, Jure
Semantic relatedness between words has been extracted from a variety of sources. In this ongoing work, we explore and compare several options for determining if semantic relatedness can be extracted from navigation structures in Wikipedia. In that direction, we first investigate the potential of representation learning techniques such as DeepWalk in comparison to previously applied methods based on counting co-occurrences. Since both methods are based on (random) paths in the network, we also study different approaches to generate paths from Wikipedia link structure. For this task, we do not only consider the link structure of Wikipedia, but also actual navigation behavior of users. Finally, we analyze if semantics can also be extracted from smaller subsets of the Wikipedia link network. As a result we find that representa- tion learning techniques mostly outperform the investigated co-occurrence counting methods on the Wikipedia network. However, we find that this is not the case for paths sampled from human navigation behavior.
Mining Subgroups with Exceptional Transition Behavior.Lemmerich, Florian; Becker, Martin; Singer, Philipp; Helic, Denis; Hotho, Andreas; Strohmaier, Markus B. Krishnapuram, Shah, M., Smola, A. J., Aggarwal, C., Shen, D., Rastogi, R. (eds.) (2016). 965-974.
Extracting Semantics from Unconstrained Navigation on Wikipedia.Niebler, Thomas; Schlör, Daniel; Becker, Martin; Hotho, Andreas in KI (2016). 30(2) 163-168.
SparkTrails: A MapReduce Implementation of HypTrails for Comparing Hypotheses About Human Trails.Becker, Martin; Mewes, Hauke; Hotho, Andreas; Dimitrov, Dimitar; Lemmerich, Florian; Strohmaier, Markus J. Bourdeau, Hendler, J., Nkambou, R., Horrocks, I., Zhao, B. Y. (eds.) (2016). 17-18.
Analyzing Features for the Detection of Happy Endings in German Novels.Jannidis, Fotis; Reger, Isabella; Zehe, Albin; Becker, Martin; Hettinger, Lena; Hotho, Andreas (2016).
Analyzing Features for the Detection of Happy Endings in German Novels
With regard to a computational representation of literary plot, this paper looks at the use of sentiment analysis for happy ending detection in German novels. Its focus lies on the investigation of previously proposed sentiment features in order to gain insight about the relevance of specific features on the one hand and the implications of their performance on the other hand. Therefore, we study various partitionings of novels, considering the highly variable concept of "ending". We also show that our approach, even though still rather simple, can potentially lead to substantial findings relevant to literary studies.
With regard to a computational representation of literary plot, this paper looks at the use of sentiment analysis for happy ending detection in German novels. Its focus lies on the investigation of previously proposed sentiment features in order to gain insight about the relevance of specific features on the one hand and the implications of their performance on the other hand. Therefore, we study various partitionings of novels, considering the highly variable concept of "ending". We also show that our approach, even though still rather simple, can potentially lead to substantial findings relevant to literary studies.
FolkTrails: Interpreting Navigation Behavior in a Social Tagging System.Niebler, Thomas; Becker, Martin; Zoller, Daniel; Doerfel, Stephan; Hotho, Andreas in CIKM '16 (2016).
FolkTrails: Interpreting Navigation Behavior in a Social Tagging System
Social tagging systems have established themselves as a quick and easy way to organize information by annotating resources with tags. In recent work, user behavior in social tagging systems was studied, that is, how users assign tags, and consume content. However, it is still unclear how users make use of the navigation options they are given. Understanding their behavior and differences in behavior of different user groups is an important step towards assessing the effectiveness of a navigational concept and of improving it to better suit the users’ needs. In this work, we investigate navigation trails in the popular scholarly social tagging system BibSonomy from six years of log data. We discuss dynamic browsing behavior of the general user population and show that different navigational subgroups exhibit different navigational traits. Furthermore, we provide strong evidence that the semantic nature of the underlying folksonomy is an essential factor for explaining navigation.
Social tagging systems have established themselves as a quick and easy way to organize information by annotating resources with tags. In recent work, user behavior in social tagging systems was studied, that is, how users assign tags, and consume content. However, it is still unclear how users make use of the navigation options they are given. Understanding their behavior and differences in behavior of different user groups is an important step towards assessing the effectiveness of a navigational concept and of improving it to better suit the users’ needs. In this work, we investigate navigation trails in the popular scholarly social tagging system BibSonomy from six years of log data. We discuss dynamic browsing behavior of the general user population and show that different navigational subgroups exhibit different navigational traits. Furthermore, we provide strong evidence that the semantic nature of the underlying folksonomy is an essential factor for explaining navigation.
Creation of Specific Flow-Based Training Data Sets for Usage Behaviour Classification.Otto, Florian; Ring, Markus; Landes, Dieter; Hotho, Andreas (2016). 437.
On Publication Usage in a Social Bookmarking System
Scholarly success is traditionally measured in terms of citations to publications. With the advent of publication man- agement and digital libraries on the web, scholarly usage data has become a target of investigation and new impact metrics computed on such usage data have been proposed – so called altmetrics. In scholarly social bookmarking sys- tems, scientists collect and manage publication meta data and thus reveal their interest in these publications. In this work, we investigate connections between usage metrics and citations, and find posts, exports, and page views of publications to be correlated to citations.
Scholarly success is traditionally measured in terms of citations to publications. With the advent of publication man- agement and digital libraries on the web, scholarly usage data has become a target of investigation and new impact metrics computed on such usage data have been proposed – so called altmetrics. In scholarly social bookmarking sys- tems, scientists collect and manage publication meta data and thus reveal their interest in these publications. In this work, we investigate connections between usage metrics and citations, and find posts, exports, and page views of publications to be correlated to citations.
Modeling and Extracting Load Intensity Profiles.v. Kistowski, Jóakim; Nikolas, Herbst.; Zoller, Daniel; Kounev, Samuel; Hotho, Andreas (2015).
Today’s system developers and operators face the challenge of creating software systems that make efficient use of dynamically allocated resources under highly variable and dynamic load profiles, while at the same time delivering reliable performance. Benchmarking of systems under these constraints is difficult, as state-of-the-art benchmarking frameworks provide only limited support for emulating such dynamic and highly vari- able load profiles for the creation of realistic workload scenarios. Industrial benchmarks typically confine themselves to workloads with constant or stepwise increasing loads. Alternatively, they support replaying of recorded load traces. Statistical load inten- sity descriptions also do not sufficiently capture concrete pattern load profile variations over time. To address these issues, we present the Descartes Load Intensity Model (DLIM). DLIM provides a modeling formalism for describing load intensity variations over time. A DLIM instance can be used as a compact representation of a recorded load intensity trace, providing a powerful tool for benchmarking and performance analysis. As manually obtaining DLIM instances can be time consuming, we present three different automated extraction methods, which also help to enable autonomous system analysis for self-adaptive systems. Model expressiveness is validated using the presented extraction methods. Extracted DLIM instances exhibit a median modeling error of 12.4% on average over nine different real-world traces covering between two weeks and seven months. Additionally, extraction methods perform orders of magnitude faster than existing time series decomposition approaches.
Today’s system developers and operators face the challenge of creating software systems that make efficient use of dynamically allocated resources under highly variable and dynamic load profiles, while at the same time delivering reliable performance. Benchmarking of systems under these constraints is difficult, as state-of-the-art benchmarking frameworks provide only limited support for emulating such dynamic and highly vari- able load profiles for the creation of realistic workload scenarios. Industrial benchmarks typically confine themselves to workloads with constant or stepwise increasing loads. Alternatively, they support replaying of recorded load traces. Statistical load inten- sity descriptions also do not sufficiently capture concrete pattern load profile variations over time. To address these issues, we present the Descartes Load Intensity Model (DLIM). DLIM provides a modeling formalism for describing load intensity variations over time. A DLIM instance can be used as a compact representation of a recorded load intensity trace, providing a powerful tool for benchmarking and performance analysis. As manually obtaining DLIM instances can be time consuming, we present three different automated extraction methods, which also help to enable autonomous system analysis for self-adaptive systems. Model expressiveness is validated using the presented extraction methods. Extracted DLIM instances exhibit a median modeling error of 12.4% on average over nine different real-world traces covering between two weeks and seven months. Additionally, extraction methods perform orders of magnitude faster than existing time series decomposition approaches.
Participatory Patterns in an International Air Quality Monitoring Initiative.Sîrbu, Alina; Becker, Martin; Caminiti, Saverio; De Baets, Bernard; Elen, Bart; Francis, Louise; Gravino, Pietro; Hotho, Andreas; Ingarra, Stefano; Loreto, Vittorio; Molino, Andrea; Mueller, Juergen; Peters, Jan; Ricchiuti, Ferdinando; Saracino, Fabio; Servedio, Vito D. P.; Stumme, Gerd; Theunis, Jan; Tria, Francesca; Van den Bossche, Joris in PLoS ONE (2015). 10(8) e0136763.
Participatory Patterns in an International Air Quality Monitoring Initiative
The issue of sustainability is at the top of the political and societal agenda, being considered of extreme importance and urgency. Human individual action impacts the environment both locally (e.g., local air/water quality, noise disturbance) and globally (e.g., climate change, resource use). Urban environments represent a crucial example, with an increasing realization that the most effective way of producing a change is involving the citizens themselves in monitoring campaigns (a citizen science bottom-up approach). This is possible by developing novel technologies and IT infrastructures enabling large citizen participation. Here, in the wider framework of one of the first such projects, we show results from an international competition where citizens were involved in mobile air pollution monitoring using low cost sensing devices, combined with a web-based game to monitor perceived levels of pollution. Measures of shift in perceptions over the course of the campaign are provided, together with insights into participatory patterns emerging from this study. Interesting effects related to inertia and to direct involvement in measurement activities rather than indirect information exposure are also highlighted, indicating that direct involvement can enhance learning and environmental awareness. In the future, this could result in better adoption of policies towards decreasing pollution.
<p>The issue of sustainability is at the top of the political and societal agenda, being considered of extreme importance and urgency. Human individual action impacts the environment both locally (e.g., local air/water quality, noise disturbance) and globally (e.g., climate change, resource use). Urban environments represent a crucial example, with an increasing realization that the most effective way of producing a change is involving the citizens themselves in monitoring campaigns (a citizen science bottom-up approach). This is possible by developing novel technologies and IT infrastructures enabling large citizen participation. Here, in the wider framework of one of the first such projects, we show results from an international competition where citizens were involved in mobile air pollution monitoring using low cost sensing devices, combined with a web-based game to monitor perceived levels of pollution. Measures of shift in perceptions over the course of the campaign are provided, together with insights into participatory patterns emerging from this study. Interesting effects related to inertia and to direct involvement in measurement activities rather than indirect information exposure are also highlighted, indicating that direct involvement can enhance learning and environmental awareness. In the future, this could result in better adoption of policies towards decreasing pollution.</p>
Participatory Patterns in an International Air Quality Monitoring Initiative.Sîrbu, Alina; Becker, Martin; Caminiti, Saverio; De Baets, Bernard; Elen, Bart; Francis, Louise; Gravino, Pietro; Hotho, Andreas; Ingarra, Stefano; Loreto, Vittorio; Molino, Andrea; Mueller, Juergen; Peters, Jan; Ricchiuti, Ferdinando; Saracino, Fabio; Servedio, Vito D. P.; Stumme, Gerd; Theunis, Jan; Tria, Francesca; Van den Bossche, Joris (2015).
Participatory Patterns in an International Air Quality Monitoring Initiative
The issue of sustainability is at the top of the political and societal agenda, being considered of extreme importance and urgency. Human individual action impacts the environment both locally (e.g., local air/water quality, noise disturbance) and globally (e.g., climate change, resource use). Urban environments represent a crucial example, with an increasing realization that the most effective way of producing a change is involving the citizens themselves in monitoring campaigns (a citizen science bottom-up approach). This is possible by developing novel technologies and IT infrastructures enabling large citizen participation. Here, in the wider framework of one of the first such projects, we show results from an international competition where citizens were involved in mobile air pollution monitoring using low cost sensing devices, combined with a web-based game to monitor perceived levels of pollution. Measures of shift in perceptions over the course of the campaign are provided, together with insights into participatory patterns emerging from this study. Interesting effects related to inertia and to direct involvement in measurement activities rather than indirect information exposure are also highlighted, indicating that direct involvement can enhance learning and environmental awareness. In the future, this could result in better adoption of policies towards decreasing pollution.
The issue of sustainability is at the top of the political and societal agenda, being considered of extreme importance and urgency. Human individual action impacts the environment both locally (e.g., local air/water quality, noise disturbance) and globally (e.g., climate change, resource use). Urban environments represent a crucial example, with an increasing realization that the most effective way of producing a change is involving the citizens themselves in monitoring campaigns (a citizen science bottom-up approach). This is possible by developing novel technologies and IT infrastructures enabling large citizen participation. Here, in the wider framework of one of the first such projects, we show results from an international competition where citizens were involved in mobile air pollution monitoring using low cost sensing devices, combined with a web-based game to monitor perceived levels of pollution. Measures of shift in perceptions over the course of the campaign are provided, together with insights into participatory patterns emerging from this study. Interesting effects related to inertia and to direct involvement in measurement activities rather than indirect information exposure are also highlighted, indicating that direct involvement can enhance learning and environmental awareness. In the future, this could result in better adoption of policies towards decreasing pollution.
MicroTrails: Comparing Hypotheses About Task Selection on a Crowdsourcing Platform.Becker, Martin; Borchert, Kathrin; Hirth, Matthias; Mewes, Hauke; Hotho, Andreas; Tran-Gia, Phuoc in i-KNOW '15 (2015). 10:1--10:8.
MicroTrails: Comparing Hypotheses About Task Selection on a Crowdsourcing Platform
To optimize the workflow on commercial crowdsourcing platforms like Amazon Mechanical Turk or Microworkers, it is important to understand how users choose their tasks. Current work usually explores the underlying processes by employing user studies based on surveys with a limited set of participants. In contrast, we formulate hypotheses based on the different findings in these studies and, instead of verifying them based on user feedback, we compare them directly on data from a commercial crowdsourcing platform. For evaluation, we use a Bayesian approach called HypTrails which allows us to give a relative ranking of the corresponding hypotheses. The hypotheses considered, are for example based on task categories, monetary incentives or semantic similarity of task descriptions. We find that, in our scenario, hypotheses based on employers as well the the task descriptions work best. Overall, we objectively compare different factors influencing users when choosing their tasks. Our approach enables crowdsourcing companies to better understand their users in order to optimize their platforms, e.g., by incorparting the gained knowledge about these factors into task recommentation systems.
To optimize the workflow on commercial crowdsourcing platforms like Amazon Mechanical Turk or Microworkers, it is important to understand how users choose their tasks. Current work usually explores the underlying processes by employing user studies based on surveys with a limited set of participants. In contrast, we formulate hypotheses based on the different findings in these studies and, instead of verifying them based on user feedback, we compare them directly on data from a commercial crowdsourcing platform. For evaluation, we use a Bayesian approach called HypTrails which allows us to give a relative ranking of the corresponding hypotheses. The hypotheses considered, are for example based on task categories, monetary incentives or semantic similarity of task descriptions. We find that, in our scenario, hypotheses based on employers as well the the task descriptions work best. Overall, we objectively compare different factors influencing users when choosing their tasks. Our approach enables crowdsourcing companies to better understand their users in order to optimize their platforms, e.g., by incorparting the gained knowledge about these factors into task recommentation systems.
Automatic Threshold Calculation for the Categorical Distance Measure ConDist.Ring, Markus; Landes, Dieter; Hotho, Andreas in CEUR Workshop Proceedings, R. Bergmann, Görg, S., Müller, G. (eds.) (2015). (Vol. 1458) 52-63.
VizTrails: An Information Visualization Tool for Exploring Geographic Movement Trajectories.Becker, Martin; Singer, Philipp; Lemmerich, Florian; Hotho, Andreas; Helic, Denis; Strohmaier, Markus in HT '15 (2015). 319--320.
VizTrails: An Information Visualization Tool for Exploring Geographic Movement Trajectories
Understanding the way people move through urban areas represents an important problem that has implications for a range of societal challenges such as city planning, public transportation, or crime analysis. In this paper, we present an interactive visualization tool called VizTrails for exploring and understanding such human movement. It features visualizations that show aggregated statistics of trails for geographic areas that correspond to grid cells on a map, e.g., on the number of users passing through or on cells commonly visited next. Amongst other features, system allows to overlay the map with the results of SPARQL queries in order to relate the observed trajectory statistics with its geo-spatial context, e.g., considering a city's points of interest. The systems functionality is demonstrated using trajectory examples extracted from the social photo sharing platform Flickr. Overall, VizTrails facilitates deeper insights into geo-spatial trajectory data by enabling interactive exploration of aggregated statistics and providing geo-spatial context.
Understanding the way people move through urban areas represents an important problem that has implications for a range of societal challenges such as city planning, public transportation, or crime analysis. In this paper, we present an interactive visualization tool called VizTrails for exploring and understanding such human movement. It features visualizations that show aggregated statistics of trails for geographic areas that correspond to grid cells on a map, e.g., on the number of users passing through or on cells commonly visited next. Amongst other features, system allows to overlay the map with the results of SPARQL queries in order to relate the observed trajectory statistics with its geo-spatial context, e.g., considering a city's points of interest. The systems functionality is demonstrated using trajectory examples extracted from the social photo sharing platform Flickr. Overall, VizTrails facilitates deeper insights into geo-spatial trajectory data by enabling interactive exploration of aggregated statistics and providing geo-spatial context.
Genre classification on German novels.Hettinger, Lena; Becker, Martin; Reger, Isabella; Jannidis, Fotis; Hotho, Andreas (2015).
Of course we share! Testing Assumptions about Social Tagging Systems.Doerfel, Stephan; Zoller, Daniel; Singer, Philipp; Niebler, Thomas; Hotho, Andreas; Strohmaier, Markus (2014).
Of course we share! Testing Assumptions about Social Tagging Systems
Social tagging systems have established themselves as an important part in today's web and have attracted the interest from our research community in a variety of investigations. The overall vision of our community is that simply through interactions with the system, i.e., through tagging and sharing of resources, users would contribute to building useful semantic structures as well as resource indexes using uncontrolled vocabulary not only due to the easy-to-use mechanics. Henceforth, a variety of assumptions about social tagging systems have emerged, yet testing them has been difficult due to the absence of suitable data. In this work we thoroughly investigate three available assumptions - e.g., is a tagging system really social? - by examining live log data gathered from the real-world public social tagging system BibSonomy. Our empirical results indicate that while some of these assumptions hold to a certain extent, other assumptions need to be reflected and viewed in a very critical light. Our observations have implications for the design of future search and other algorithms to better reflect the actual user behavior.
Social tagging systems have established themselves as an important part in today's web and have attracted the interest from our research community in a variety of investigations. The overall vision of our community is that simply through interactions with the system, i.e., through tagging and sharing of resources, users would contribute to building useful semantic structures as well as resource indexes using uncontrolled vocabulary not only due to the easy-to-use mechanics. Henceforth, a variety of assumptions about social tagging systems have emerged, yet testing them has been difficult due to the absence of suitable data. In this work we thoroughly investigate three available assumptions - e.g., is a tagging system really social? - by examining live log data gathered from the real-world public social tagging system BibSonomy. Our empirical results indicate that while some of these assumptions hold to a certain extent, other assumptions need to be reflected and viewed in a very critical light. Our observations have implications for the design of future search and other algorithms to better reflect the actual user behavior.
Ubicon and its Applications for Ubiquitous Social Computing.Atzmueller, Martin; Becker, Martin; Kibanov, Mark; Scholz, Christoph; Doerfel, Stephan; Hotho, Andreas; Macek, Bjoern-Elmar; Mitzlaff, Folke; Mueller, Juergen; Stumme, Gerd in New Review of Hypermedia and Multimedia (2014). 20(1) 53--77.
Ubicon and its Applications for Ubiquitous Social Computing
The combination of ubiquitous and social computing is an emerging research area which integrates different but complementary methods, techniques and tools. In this paper, we focus on the Ubicon platform, its applications, and a large spectrum of analysis results. Ubicon provides an extensible framework for building and hosting applications targeting both ubiquitous and social environments. We summarize the architecture and exemplify its implementation using four real-world applications built on top of Ubicon. In addition, we discuss several scientific experiments in the context of these applications in order to give a better picture of the potential of the framework, and discuss analysis results using several real-world data sets collected utilizing Ubicon.
The combination of ubiquitous and social computing is an emerging research area which integrates different but complementary methods, techniques and tools. In this paper, we focus on the Ubicon platform, its applications, and a large spectrum of analysis results. Ubicon provides an extensible framework for building and hosting applications targeting both ubiquitous and social environments. We summarize the architecture and exemplify its implementation using four real-world applications built on top of Ubicon. In addition, we discuss several scientific experiments in the context of these applications in order to give a better picture of the potential of the framework, and discuss analysis results using several real-world data sets collected utilizing Ubicon.
Proceedings of the 1st International Workshop on Interactions between Data Mining and Natural Language Processing co-located with The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, DMNLP@PKDD/ECML 2014, Nancy, France, September 15, 2014Cellier, Peggy; Charnois, Thierry; Hotho, Andreas; Matwin, Stan; Moens, Marie-Francine; Toussaint, Yannick in CEUR Workshop Proceedings (2014). (Vol. 1202) CEUR-WS.org.
Proceedings of the 1st International Workshop on Interactions between Data Mining and Natural Language Processing co-located with The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, DMNLP@PKDD/ECML 2014, Nancy, France, September 15, 2014
Weitere Informationen
Herausgeber
Cellier, Peggy and Charnois, Thierry and Hotho, Andreas and Matwin, Stan and Moens, Marie-Francine and Toussaint, Yannick
Evaluating Assumptions about Social Tagging - A Study of User Behavior in BibSonomy.Doerfel, Stephan; Zoller, Daniel; Singer, Philipp; Niebler, Thomas; Hotho, Andreas; Strohmaier, Markus T. Seidl, Hassani, M., Beecks, C. (eds.) (2014). 18--19.
How Social is Social Tagging?Doerfel, Stephan; Zoller, Daniel; Singer, Philipp; Niebler, Thomas; Hotho, Andreas; Strohmaier, Markus in WWW 2014 (2014).
The sixth ACM RecSys workshop on recommender systems and the social web.Jannach, Dietmar; Freyne, Jill; Geyer, Werner; Guy, Ido; Hotho, Andreas; Mobasher, Bamshad (2014). 395.
HypTrails: A Bayesian Approach for Comparing Hypotheses about Human Trails on the Web.Singer, Philipp; Helic, Denis; Hotho, Andreas; Strohmaier, Markus (2014).
HypTrails: A Bayesian Approach for Comparing Hypotheses about Human Trails on the Web
When users interact with the Web today, they leave sequential digital trails on a massive scale. Examples of such human trails include Web navigation, sequences of online restaurant reviews, or online music play lists. Understanding the factors that drive the production of these trails can be useful for e.g., improving underlying network structures, predicting user clicks or enhancing recommendations. In this work, we present a general approach called HypTrails for comparing a set of hypotheses about human trails on the Web, where hypotheses represent beliefs about transitions between states. Our approach utilizes Markov chain models with Bayesian inference. The main idea is to incorporate hypotheses as informative Dirichlet priors and to leverage the sensitivity of Bayes factors on the prior for comparing hypotheses with each other. For eliciting Dirichlet priors from hypotheses, we present an adaption of the so-called (trial) roulette method. We demonstrate the general mechanics and applicability of HypTrails by performing experiments with (i) synthetic trails for which we control the mechanisms that have produced them and (ii) empirical trails stemming from different domains including website navigation, business reviews and online music played. Our work expands the repertoire of methods available for studying human trails on the Web.
When users interact with the Web today, they leave sequential digital trails on a massive scale. Examples of such human trails include Web navigation, sequences of online restaurant reviews, or online music play lists. Understanding the factors that drive the production of these trails can be useful for e.g., improving underlying network structures, predicting user clicks or enhancing recommendations. In this work, we present a general approach called HypTrails for comparing a set of hypotheses about human trails on the Web, where hypotheses represent beliefs about transitions between states. Our approach utilizes Markov chain models with Bayesian inference. The main idea is to incorporate hypotheses as informative Dirichlet priors and to leverage the sensitivity of Bayes factors on the prior for comparing hypotheses with each other. For eliciting Dirichlet priors from hypotheses, we present an adaption of the so-called (trial) roulette method. We demonstrate the general mechanics and applicability of HypTrails by performing experiments with (i) synthetic trails for which we control the mechanisms that have produced them and (ii) empirical trails stemming from different domains including website navigation, business reviews and online music played. Our work expands the repertoire of methods available for studying human trails on the Web.
Proceedings of the 6th Workshop on Recommender Systems and the Social Web (RSWeb 2014) co-located with the 8th ACM Conference on Recommender Systems (RecSys 2014), Foster City, CA, USA, October 6, 2014Jannach, Dietmar; Freyne, Jill; Geyer, Werner; Guy, Ido; Hotho, Andreas; Mobasher, Bamshad in CEUR Workshop Proceedings (2014). (Vol. 1271) CEUR-WS.org.
Proceedings of the 6th Workshop on Recommender Systems and the Social Web (RSWeb 2014) co-located with the 8th ACM Conference on Recommender Systems (RecSys 2014), Foster City, CA, USA, October 6, 2014
Weitere Informationen
Herausgeber
Jannach, Dietmar and Freyne, Jill and Geyer, Werner and Guy, Ido and Hotho, Andreas and Mobasher, Bamshad
The social distributional hypothesis: a pragmatic proxy for homophily in online social networks.Mitzlaff, Folke; Atzmueller, Martin; Hotho, Andreas; Stumme, Gerd in Social Network Analysis and Mining (2014). 4(1)
The social distributional hypothesis: a pragmatic proxy for homophily in online social networks
Applications of the Social Web are ubiquitous and have become an integral part of everyday life: Users make friends, for example, with the help of online social networks, share thoughts via Twitter, or collaboratively write articles in Wikipedia. All such interactions leave digital traces; thus, users participate in the creation of heterogeneous, distributed, collaborative data collections. In linguistics, the
Applications of the Social Web are ubiquitous and have become an integral part of everyday life: Users make friends, for example, with the help of online social networks, share thoughts via Twitter, or collaboratively write articles in Wikipedia. All such interactions leave digital traces; thus, users participate in the creation of heterogeneous, distributed, collaborative data collections. In linguistics, the
Folksonomies.Singer, Philipp; Niebler, Thomas; Hotho, Andreas; Strohmaier, Markus in Encyclopedia of Social Network Analysis and Mining (2014). 542--547.
Sensor data is objective. But when measuring our environment, measured values are contrasted with our perception, which is always subjective. This makes interpreting sensor measurements difficult for a single person in her personal environment. In this context, the EveryAware projects directly connects the concepts of objective sensor data with subjective impressions and perceptions by providing a collective sensing platform with several client applications allowing to explicitly associate those two data types. The goal is to provide the user with personalized feedback, a characterization of the global as well as her personal environment, and enable her to position her perceptions in this global context. In this poster we summarize the collected data of two EveryAware applications, namely WideNoise for noise measurements and AirProbe for participatory air quality sensing. Basic insights are presented including user activity, learning processes and sensor data to perception correlations. These results provide an outlook on how this data can further be used to understand the connection between sensor data and perceptions.
Sensor data is objective. But when measuring our environment, measured values are contrasted with our perception, which is always subjective. This makes interpreting sensor measurements difficult for a single person in her personal environment. In this context, the EveryAware projects directly connects the concepts of objective sensor data with subjective impressions and perceptions by providing a collective sensing platform with several client applications allowing to explicitly associate those two data types. The goal is to provide the user with personalized feedback, a characterization of the global as well as her personal environment, and enable her to position her perceptions in this global context. In this poster we summarize the collected data of two EveryAware applications, namely WideNoise for noise measurements and AirProbe for participatory air quality sensing. Basic insights are presented including user activity, learning processes and sensor data to perception correlations. These results provide an outlook on how this data can further be used to understand the connection between sensor data and perceptions.
How Tagging Pragmatics Influence Tag Sense Discovery in Social Annotation Systems.Niebler, Thomas; Singer, Philipp; Benz, Dominik; Körner, Christian; Strohmaier, Markus; Hotho, Andreas in Advances in Information Retrieval, P. Serdyukov, Braslavski, P., Kuznetsov, S. O., Kamps, J., Rüger, S., Agichtein, E., Segalovich, I., Yilmaz, E. (eds.) (2013). (Vol. 7814) 86-97.
How Tagging Pragmatics Influence Tag Sense Discovery in Social Annotation Systems
The presence of emergent semantics in social annotation systems has been reported in numerous studies. Two important problems in this context are the induction of semantic relations among tags and the discovery of different senses of a given tag. While a number of approaches for discovering tag senses exist, little is known about which
Weitere Informationen
Herausgeber
Serdyukov, Pavel and Braslavski, Pavel and Kuznetsov, SergeiO. and Kamps, Jaap and Rüger, Stefan and Agichtein, Eugene and Segalovich, Ilya and Yilmaz, Emine
The presence of emergent semantics in social annotation systems has been reported in numerous studies. Two important problems in this context are the induction of semantic relations among tags and the discovery of different senses of a given tag. While a number of approaches for discovering tag senses exist, little is known about which
A Generic Platform for Ubiquitous and Subjective Data.Becker, Martin; Mueller, Juergen; Hotho, Andreas; Stumme, Gerd (2013). New York, NY, USA.
A Generic Platform for Ubiquitous and Subjective Data
An increasing number of platforms like Xively or ThingSpeak are available to manage ubiquitous sensor data enabling the Internet of Things. Strict data formats allow interoperability and informative visualizations, supporting the development of custom user applications. Yet, these strict data formats as well as the common feed-centric approach limit the flexibility of these platforms. We aim at providing a concept that supports data ranging from text-based formats like JSON to images and video footage. Furthermore, we introduce the concept of extensions, which allows to enrich existing data points with additional information, thus, taking a data point centric approach. This enables us to gain semantic and user specific context by attaching subjective data to objective values. This paper provides an overview of our architecture including concept, implementation details and present applications. We distinguish our approach from several other systems and describe two sensing applications namely AirProbe and WideNoise that were implemented for our platform.
An increasing number of platforms like Xively or ThingSpeak are available to manage ubiquitous sensor data enabling the Internet of Things. Strict data formats allow interoperability and informative visualizations, supporting the development of custom user applications. Yet, these strict data formats as well as the common feed-centric approach limit the flexibility of these platforms. We aim at providing a concept that supports data ranging from text-based formats like JSON to images and video footage. Furthermore, we introduce the concept of extensions, which allows to enrich existing data points with additional information, thus, taking a data point centric approach. This enables us to gain semantic and user specific context by attaching subjective data to objective values. This paper provides an overview of our architecture including concept, implementation details and present applications. We distinguish our approach from several other systems and describe two sensing applications namely AirProbe and WideNoise that were implemented for our platform.
Semantics of User Interaction in Social Media.Mitzlaff, Folke; Atzmueller, Martin; Stumme, Gerd; Hotho, Andreas in Complex Networks IV, G. Ghoshal, Poncela-Casasnovas, J., Tolksdorf, R. (eds.) (2013). (Vol. 476)
Deeper Into the Folksonomy Graph: FolkRank Adaptations and Extensions for Improved Tag Recommendations.Landia, Nikolas; Doerfel, Stephan; Jäschke, Robert; Anand, Sarabjot Singh; Hotho, Andreas; Griffiths, Nathan in cs.IR (2013). 1310.1498
Deeper Into the Folksonomy Graph: FolkRank Adaptations and Extensions for Improved Tag Recommendations
The information contained in social tagging systems is often modelled as a graph of connections between users, items and tags. Recommendation algorithms such as FolkRank, have the potential to leverage complex relationships in the data, corresponding to multiple hops in the graph. We present an in-depth analysis and evaluation of graph models for social tagging data and propose novel adaptations and extensions of FolkRank to improve tag recommendations. We highlight implicit assumptions made by the widely used folksonomy model, and propose an alternative and more accurate graph-representation of the data. Our extensions of FolkRank address the new item problem by incorporating content data into the algorithm, and significantly improve prediction results on unpruned datasets. Our adaptations address issues in the iterative weight spreading calculation that potentially hinder FolkRank's ability to leverage the deep graph as an information source. Moreover, we evaluate the benefit of considering each deeper level of the graph, and present important insights regarding the characteristics of social tagging data in general. Our results suggest that the base assumption made by conventional weight propagation methods, that closeness in the graph always implies a positive relationship, does not hold for the social tagging domain.
The information contained in social tagging systems is often modelled as a graph of connections between users, items and tags. Recommendation algorithms such as FolkRank, have the potential to leverage complex relationships in the data, corresponding to multiple hops in the graph. We present an in-depth analysis and evaluation of graph models for social tagging data and propose novel adaptations and extensions of FolkRank to improve tag recommendations. We highlight implicit assumptions made by the widely used folksonomy model, and propose an alternative and more accurate graph-representation of the data. Our extensions of FolkRank address the new item problem by incorporating content data into the algorithm, and significantly improve prediction results on unpruned datasets. Our adaptations address issues in the iterative weight spreading calculation that potentially hinder FolkRank's ability to leverage the deep graph as an information source. Moreover, we evaluate the benefit of considering each deeper level of the graph, and present important insights regarding the characteristics of social tagging data in general. Our results suggest that the base assumption made by conventional weight propagation methods, that closeness in the graph always implies a positive relationship, does not hold for the social tagging domain.
Awareness and Learning in Participatory Noise Sensing.Becker, Martin; Caminiti, Saverio; Fiorella, Donato; Francis, Louise; Gravino, Pietro; Haklay, Mordechai (Muki); Hotho, Andreas; Loreto, Vittorio; Mueller, Juergen; Ricchiuti, Ferdinando; Servedio, Vito D. P.; Sîrbu, Alina; Tria, Francesca in PLoS ONE (2013). 8(12) e81638.
Awareness and Learning in Participatory Noise Sensing
The development of ICT infrastructures has facilitated the emergence of new paradigms for looking at society and the environment over the last few years. Participatory environmental sensing, i.e. directly involving citizens in environmental monitoring, is one example, which is hoped to encourage learning and enhance awareness of environmental issues. In this paper, an analysis of the behaviour of individuals involved in noise sensing is presented. Citizens have been involved in noise measuring activities through the WideNoise smartphone application. This application has been designed to record both objective (noise samples) and subjective (opinions, feelings) data. The application has been open to be used freely by anyone and has been widely employed worldwide. In addition, several test cases have been organised in European countries. Based on the information submitted by users, an analysis of emerging awareness and learning is performed. The data show that changes in the way the environment is perceived after repeated usage of the application do appear. Specifically, users learn how to recognise different noise levels they are exposed to. Additionally, the subjective data collected indicate an increased user involvement in time and a categorisation effect between pleasant and less pleasant environments.
<p>The development of ICT infrastructures has facilitated the emergence of new paradigms for looking at society and the environment over the last few years. Participatory environmental sensing, i.e. directly involving citizens in environmental monitoring, is one example, which is hoped to encourage learning and enhance awareness of environmental issues. In this paper, an analysis of the behaviour of individuals involved in noise sensing is presented. Citizens have been involved in noise measuring activities through the WideNoise smartphone application. This application has been designed to record both objective (noise samples) and subjective (opinions, feelings) data. The application has been open to be used freely by anyone and has been widely employed worldwide. In addition, several test cases have been organised in European countries. Based on the information submitted by users, an analysis of emerging awareness and learning is performed. The data show that changes in the way the environment is perceived after repeated usage of the application do appear. Specifically, users learn how to recognise different noise levels they are exposed to. Additionally, the subjective data collected indicate an increased user involvement in time and a categorisation effect between pleasant and less pleasant environments.</p>
Tag Recommendations for SensorFolkSonomies.Mueller, Juergen; Doerfel, Stephan; Becker, Martin; Hotho, Andreas; Stumme, Gerd (2013). (Vol. 1066)
With the rising popularity of smart mobile devices, sensor data-based applications have become more and more popular. Their users record data during their daily routine or specifically for certain events. The application WideNoise Plus allows users to record sound samples and to annotate them with perceptions and tags. The app is being used to document and map the soundscape all over the world. The procedure of recording, including the assignment of tags, has to be as easy-to-use as possible. We therefore discuss the application of tag recommender algorithms in this particular scenario. We show, that this task is fundamentally different from the well-known tag recommendation problem in folksonomies as users do no longer tag fix resources but rather sensory data and impressions. The scenario requires efficient recommender algorithms that are able to run on the mobile device, since Internet connectivity cannot be assumed to be available. Therefore, we evaluate the performance of several tag recommendation algorithms and discuss their applicability in the mobile sensing use-case.
With the rising popularity of smart mobile devices, sensor data-based applications have become more and more popular. Their users record data during their daily routine or specifically for certain events. The application WideNoise Plus allows users to record sound samples and to annotate them with perceptions and tags. The app is being used to document and map the soundscape all over the world. The procedure of recording, including the assignment of tags, has to be as easy-to-use as possible. We therefore discuss the application of tag recommender algorithms in this particular scenario. We show, that this task is fundamentally different from the well-known tag recommendation problem in folksonomies as users do no longer tag fix resources but rather sensory data and impressions. The scenario requires efficient recommender algorithms that are able to run on the mobile device, since Internet connectivity cannot be assumed to be available. Therefore, we evaluate the performance of several tag recommendation algorithms and discuss their applicability in the mobile sensing use-case.
Ubiquitous Social Media Analysis Third International Workshops, MUSE 2012, Bristol, UK, September 24, 2012, and MSM 2012, Milwaukee, WI, USA, June 25, 2012, Revised Selected PapersAtzmueller, Martin; Chin, Alvin; Helic, Denis; Hotho, Andreas (2013). Imprint: Springer, Berlin, Heidelberg.
Ubiquitous Social Media Analysis Third International Workshops, MUSE 2012, Bristol, UK, September 24, 2012, and MSM 2012, Milwaukee, WI, USA, June 25, 2012, Revised Selected Papers
Weitere Informationen
Herausgeber
Atzmueller, Martin and Chin, Alvin and Helic, Denis and Hotho, Andreas
Proceedings of the Fifth ACM RecSys Workshop on Recommender Systems and the Social Web co-located with the 7th ACM Conference on Recommender Systems (RecSys 2013), Hong Kong, China, October 13, 2013.Mobasher, Bamshad; Jannach, Dietmar; Geyer, Werner; Freyne, Jill; Hotho, Andreas; Anand, Sarabjot Singh; Guy, Ido in CEUR Workshop Proceedings (2013). (Vol. 1066) CEUR-WS.org.
Proceedings of the Fifth ACM RecSys Workshop on Recommender Systems and the Social Web co-located with the 7th ACM Conference on Recommender Systems (RecSys 2013), Hong Kong, China, October 13, 2013.
Weitere Informationen
Herausgeber
Mobasher, Bamshad and Jannach, Dietmar and Geyer, Werner and Freyne, Jill and Hotho, Andreas and Anand, Sarabjot Singh and Guy, Ido
Exploiting Structural Consistencies with Stacked Conditional Random Fields.Kluegl, Peter; Toepfer, Martin; Lemmerich, Florian; Hotho, Andreas; Puppe, Frank in Mathematical Methodologies in Pattern Recognition and Machine Learning Springer Proceedings in Mathematics & Statistics (2013). 30 111-125.
Exploiting Structural Consistencies with Stacked Conditional Random Fields
Conditional Random Fields (CRF) are popular methods for labeling unstructured or textual data. Like many machine learning approaches, these undirected graphical models assume the instances to be independently distributed. However, in real-world applications data is grouped in a natural way, e.g., by its creation context. The instances in each group often share additional structural consistencies. This paper proposes a domain-independent method for exploiting these consistencies by combining two CRFs in a stacked learning framework. We apply rule learning collectively on the predictions of an initial CRF for one context to acquire descriptions of its specific properties. Then, we utilize these descriptions as dynamic and high quality features in an additional (stacked) CRF. The presented approach is evaluated with a real-world dataset for the segmentation of references and achieves a significant reduction of the labeling error.
Conditional Random Fields (CRF) are popular methods for labeling unstructured or textual data. Like many machine learning approaches, these undirected graphical models assume the instances to be independently distributed. However, in real-world applications data is grouped in a natural way, e.g., by its creation context. The instances in each group often share additional structural consistencies. This paper proposes a domain-independent method for exploiting these consistencies by combining two CRFs in a stacked learning framework. We apply rule learning collectively on the predictions of an initial CRF for one context to acquire descriptions of its specific properties. Then, we utilize these descriptions as dynamic and high quality features in an additional (stacked) CRF. The presented approach is evaluated with a real-world dataset for the segmentation of references and achieves a significant reduction of the labeling error.
Computing Semantic Relatedness from Human Navigational Paths: A Case Study on Wikipedia.Singer, Philipp; Niebler, Thomas; Strohmaier, Markus; Hotho, Andreas in International Journal on Semantic Web and Information Systems (IJSWIS) (2013). 9(4) 41--70.
Computing Semantic Relatedness from Human Navigational Paths: A Case Study on Wikipedia
In this article, the authors present a novel approach for computing semantic relatedness and conduct a large-scale study of it on Wikipedia. Unlike existing semantic analysis methods that utilize Wikipedia’s content or link structure, the authors propose to use human navigational paths on Wikipedia for this task. The authors obtain 1.8 million human navigational paths from a semi-controlled navigation experiment – a Wikipedia-based navigation game, in which users are required to find short paths between two articles in a given Wikipedia article network. The authors’ results are intriguing: They suggest that (i) semantic relatedness computed from human navigational paths may be more precise than semantic relatedness computed from Wikipedia’s plain link structure alone and (ii) that not all navigational paths are equally useful. Intelligent selection based on path characteristics can improve accuracy. The authors’ work makes an argument for expanding the existing arsenal of data sources for calculating semantic relatedness and to consider the utility of human navigational paths for this task.
In this article, the authors present a novel approach for computing semantic relatedness and conduct a large-scale study of it on Wikipedia. Unlike existing semantic analysis methods that utilize Wikipedia’s content or link structure, the authors propose to use human navigational paths on Wikipedia for this task. The authors obtain 1.8 million human navigational paths from a semi-controlled navigation experiment – a Wikipedia-based navigation game, in which users are required to find short paths between two articles in a given Wikipedia article network. The authors’ results are intriguing: They suggest that (i) semantic relatedness computed from human navigational paths may be more precise than semantic relatedness computed from Wikipedia’s plain link structure alone and (ii) that not all navigational paths are equally useful. Intelligent selection based on path characteristics can improve accuracy. The authors’ work makes an argument for expanding the existing arsenal of data sources for calculating semantic relatedness and to consider the utility of human navigational paths for this task.
Recommender Systems for Social Tagging SystemsBalby Marinho, L.; Hotho, A.; Jäschke, R.; Nanopoulos, A.; Rendle, S.; Schmidt-Thieme, L.; Stumme, G.; Symeonidis, P. in SpringerBriefs in Electrical and Computer Engineering (2012). Springer.
Social Tagging Systems are web applications in which users upload resources (e.g., bookmarks, videos, photos, etc.) and annotate it with a list of freely chosen keywords called tags. This is a grassroots approach to organize a site and help users to find the resources they are interested in. Social tagging systems are open and inherently social; features that have been proven to encourage participation. However, with the large popularity of these systems and the increasing amount of user-contributed content, information overload rapidly becomes an issue. Recommender Systems are well known applications for increasing the level of relevant content over the “noise” that continuously grows as more and more content becomes available online. In social tagging systems, however, we face new challenges. While in classic recommender systems the mode of recommendation is basically the resource, in social tagging systems there are three possible modes of recommendation: users, resources, or tags. Therefore suitable methods that properly exploit the different dimensions of social tagging systems data are needed. In this book, we survey the most recent and state-of-the-art work about a whole new generation of recommender systems built to serve social tagging systems. The book is divided into self-contained chapters covering the background material on social tagging systems and recommender systems to the more advanced techniques like the ones based on tensor factorization and graph-based models.
Social Tagging Systems are web applications in which users upload resources (e.g., bookmarks, videos, photos, etc.) and annotate it with a list of freely chosen keywords called tags. This is a grassroots approach to organize a site and help users to find the resources they are interested in. Social tagging systems are open and inherently social; features that have been proven to encourage participation. However, with the large popularity of these systems and the increasing amount of user-contributed content, information overload rapidly becomes an issue. Recommender Systems are well known applications for increasing the level of relevant content over the “noise” that continuously grows as more and more content becomes available online. In social tagging systems, however, we face new challenges. While in classic recommender systems the mode of recommendation is basically the resource, in social tagging systems there are three possible modes of recommendation: users, resources, or tags. Therefore suitable methods that properly exploit the different dimensions of social tagging systems data are needed. In this book, we survey the most recent and state-of-the-art work about a whole new generation of recommender systems built to serve social tagging systems. The book is divided into self-contained chapters covering the background material on social tagging systems and recommender systems to the more advanced techniques like the ones based on tensor factorization and graph-based models.
4th ACM RecSys workshop on recommender systems and the social web.Mobasher, Bamshad; Jannach, Dietmar; Geyer, Werner; Hotho, Andreas P. Cunningham, Hurley, N. J., Guy, I., Anand, S. S. (eds.) (2012). 345-346.
Stacked Conditional Random Fields Exploiting Structural Consistencies.Klügl, Peter; Toepfer, Martin; Lemmerich, Florian; Hotho, Andreas; Puppe, Frank P. L. Carmona, Sánchez, J. S., Fred, A. (eds.) (2012). 240-248.
Stacked Conditional Random Fields Exploiting Structural Consistencies
Conditional Random Fields CRF are popular methods for labeling unstructured or textual data. Like many machine learning approaches these undirected graphical models assume the instances to be independently distributed. However, in real world applications data is grouped in a natural way, e.g., by its creation context. The instances in each group often share additional structural consistencies. This paper proposes a domain-independent method for exploiting these consistencies by combining two CRFs in a stacked learning framework. The approach incorporates three successive steps of inference: First, an initial CRF processes single instances as usual. Next, we apply rule learning collectively on all labeled outputs of one context to acquire descriptions of its specific properties. Finally, we utilize these descriptions as dynamic and high quality features in an additional stacked CRF. The presented approach is evaluated with a real-world dataset for the segmentation of references and achieves a significant reduction of the labeling error.
Weitere Informationen
Herausgeber
Carmona, Pedro Latorre and Sánchez, J. Salvador and Fred, Ana
Conditional Random Fields CRF are popular methods for labeling unstructured or textual data. Like many machine learning approaches these undirected graphical models assume the instances to be independently distributed. However, in real world applications data is grouped in a natural way, e.g., by its creation context. The instances in each group often share additional structural consistencies. This paper proposes a domain-independent method for exploiting these consistencies by combining two CRFs in a stacked learning framework. The approach incorporates three successive steps of inference: First, an initial CRF processes single instances as usual. Next, we apply rule learning collectively on all labeled outputs of one context to acquire descriptions of its specific properties. Finally, we utilize these descriptions as dynamic and high quality features in an additional stacked CRF. The presented approach is evaluated with a real-world dataset for the segmentation of references and achieves a significant reduction of the labeling error.
RSWeb '12: Proceedings of the 4th ACM RecSys workshop on Recommender systems and the social webMobasher, Bamshad; Jannach, Dietmar; Geyer, Werner; Hotho, Andreas (2012). ACM, Dublin, Ireland.
RSWeb '12: Proceedings of the 4th ACM RecSys workshop on Recommender systems and the social web
The new opportunities for applying recommendation techniques within Social Web platforms and applications as well as the various new sources of information which have become available in the Web 2.0 and can be incorporated in future recommender applications are a strong driving factor in current recommender system research for various reasons:
(1) Social systems by their definition encourage interaction between users and both online content and other users, thus generating new sources of knowledge for recommender systems. Web 2.0 users explicitly provide personal information and implicitly express preferences through their interactions with others and the system (e.g. commenting, friending, rating, etc.). These various new sources of knowledge can be leveraged to improve recommendation techniques and develop new strategies which focus on social recommendation.
(2) New application areas for recommender systems emerge with the popularity of the Social Web. Recommenders cannot only be used to sort and filter Web 2.0 and social network information, they can also support users in the information sharing process, e.g., by recommending suitable tags during folksonomy development.
(3) Recommender technology can assist Social Web systems through increasing adoption and participation and sustaining membership. Through targeted and timely intervention which stimulates traffic and interaction, recommender technology can play its role in sustaining the success of the Social Web.
(4) The Social Web also presents new challenges for recommender systems, such as the complicated nature of human-to-human interaction which comes into play when recommending people and can require more interactive and richer recommender systems user interfaces.
The technical papers appearing in these proceedings aim to explore and understand challenges and new opportunities for recommender systems in the Social Web and were selected in a formal review process by an international program committee.
Overall, we received 13 paper submissions from 12 different countries, out of which 7 long papers and 1 short paper were selected for presentation and inclusion in the proceedings. The submitted papers addressed a variety of topics related to Social Web recommender systems from the use of microblogging data for personalization over new tag recommendation approaches to social media-based personalization of news.
The new opportunities for applying recommendation techniques within Social Web platforms and applications as well as the various new sources of information which have become available in the Web 2.0 and can be incorporated in future recommender applications are a strong driving factor in current recommender system research for various reasons:</p> <p>(1) Social systems by their definition encourage interaction between users and both online content and other users, thus generating new sources of knowledge for recommender systems. Web 2.0 users explicitly provide personal information and implicitly express preferences through their interactions with others and the system (e.g. commenting, friending, rating, etc.). These various new sources of knowledge can be leveraged to improve recommendation techniques and develop new strategies which focus on social recommendation.</p> <p>(2) New application areas for recommender systems emerge with the popularity of the Social Web. Recommenders cannot only be used to sort and filter Web 2.0 and social network information, they can also support users in the information sharing process, e.g., by recommending suitable tags during folksonomy development.</p> <p>(3) Recommender technology can assist Social Web systems through increasing adoption and participation and sustaining membership. Through targeted and timely intervention which stimulates traffic and interaction, recommender technology can play its role in sustaining the success of the Social Web.</p> <p>(4) The Social Web also presents new challenges for recommender systems, such as the complicated nature of human-to-human interaction which comes into play when recommending people and can require more interactive and richer recommender systems user interfaces.</p> <p>The technical papers appearing in these proceedings aim to explore and understand challenges and new opportunities for recommender systems in the Social Web and were selected in a formal review process by an international program committee.</p> <p>Overall, we received 13 paper submissions from 12 different countries, out of which 7 long papers and 1 short paper were selected for presentation and inclusion in the proceedings. The submitted papers addressed a variety of topics related to Social Web recommender systems from the use of microblogging data for personalization over new tag recommendation approaches to social media-based personalization of news.
Challenges in Tag Recommendations for Collaborative Tagging Systems.Jäschke, Robert; Hotho, Andreas; Mitzlaff, Folke; Stumme, Gerd in Recommender Systems for the Social Web, J. J. Pazos Arias, Fernández Vilas, A., Díaz Redondo, R. P. (eds.) (2012). (Vol. 32) 65--87.
Challenges in Tag Recommendations for Collaborative Tagging Systems
Originally introduced by social bookmarking systems, collaborative tagging, or social tagging, has been widely adopted by many web-based systems like wikis, e-commerce platforms, or social networks. Collaborative tagging systems allow users to annotate resources using freely chosen keywords, so called tags . Those tags help users in finding/retrieving resources, discovering new resources, and navigating through the system. The process of tagging resources is laborious. Therefore, most systems support their users by tag recommender components that recommend tags in a personalized way. The Discovery Challenges 2008 and 2009 of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) tackled the problem of tag recommendations in collaborative tagging systems. Researchers were invited to test their methods in a competition on datasets from the social bookmark and publication sharing system BibSonomy. Moreover, the 2009 challenge included an online task where the recommender systems were integrated into BibSonomy and provided recommendations in real time. In this chapter we review, evaluate and summarize the submissions to the two Discovery Challenges and thus lay the groundwork for continuing research in this area.
Weitere Informationen
Herausgeber
Pazos Arias, José J. and Fernández Vilas, Ana and Díaz Redondo, Rebeca P.
Originally introduced by social bookmarking systems, collaborative tagging, or social tagging, has been widely adopted by many web-based systems like wikis, e-commerce platforms, or social networks. Collaborative tagging systems allow users to annotate resources using freely chosen keywords, so called tags . Those tags help users in finding/retrieving resources, discovering new resources, and navigating through the system. The process of tagging resources is laborious. Therefore, most systems support their users by tag recommender components that recommend tags in a personalized way. The Discovery Challenges 2008 and 2009 of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD) tackled the problem of tag recommendations in collaborative tagging systems. Researchers were invited to test their methods in a competition on datasets from the social bookmark and publication sharing system BibSonomy. Moreover, the 2009 challenge included an online task where the recommender systems were integrated into BibSonomy and provided recommendations in real time. In this chapter we review, evaluate and summarize the submissions to the two Discovery Challenges and thus lay the groundwork for continuing research in this area.
Leveraging publication metadata and social data into FolkRank for scientific publication recommendation.Doerfel, Stephan; Jäschke, Robert; Hotho, Andreas; Stumme, Gerd in RSWeb '12 (2012). 9--16.
Leveraging publication metadata and social data into FolkRank for scientific publication recommendation
The ever-growing flood of new scientific articles requires novel retrieval mechanisms. One means for mitigating this instance of the information overload phenomenon are collaborative tagging systems, that allow users to select, share and annotate references to publications. These systems employ recommendation algorithms to present to their users personalized lists of interesting and relevant publications. In this paper we analyze different ways to incorporate social data and metadata from collaborative tagging systems into the graph-based ranking algorithm FolkRank to utilize it for recommending scientific articles to users of the social bookmarking system BibSonomy. We compare the results to those of Collaborative Filtering, which has previously been applied for resource recommendation.
The ever-growing flood of new scientific articles requires novel retrieval mechanisms. One means for mitigating this instance of the information overload phenomenon are collaborative tagging systems, that allow users to select, share and annotate references to publications. These systems employ recommendation algorithms to present to their users personalized lists of interesting and relevant publications. In this paper we analyze different ways to incorporate social data and metadata from collaborative tagging systems into the graph-based ranking algorithm FolkRank to utilize it for recommending scientific articles to users of the social bookmarking system BibSonomy. We compare the results to those of Collaborative Filtering, which has previously been applied for resource recommendation.
Publikationen im Web 2.0.Hotho, Andreas in Informatik-Spektrum (2012). 1-5.
The connection of ubiquitous and social computing is an emerging research area which is combining two prominent areas of computer science. In this paper, we tackle this topic from different angles: We describe data mining methods for ubiquitous and social data, specifically focusing on physical and social activities, and provide exemplary analysis results. Furthermore, we give an overview on the Ubicon platform which provides a framework for the creation and hosting of ubiquitous and social applications for diverse tasks and projects. Ubicon features the collection and analysis of both physical and social activities of users for enabling inter-connected applications in ubiquitous and social contexts. We summarize three real-world systems built on top of Ubicon, and exemplarily discuss the according mining and analysis aspects.
The connection of ubiquitous and social computing is an emerging research area which is combining two prominent areas of computer science. In this paper, we tackle this topic from different angles: We describe data mining methods for ubiquitous and social data, specifically focusing on physical and social activities, and provide exemplary analysis results. Furthermore, we give an overview on the Ubicon platform which provides a framework for the creation and hosting of ubiquitous and social applications for diverse tasks and projects. Ubicon features the collection and analysis of both physical and social activities of users for enabling inter-connected applications in ubiquitous and social contexts. We summarize three real-world systems built on top of Ubicon, and exemplarily discuss the according mining and analysis aspects.
The challenge of recommender systems challenges.Said, Alan; Tikk, Domonkos; Hotho, Andreas P. Cunningham, Hurley, N. J., Guy, I., Anand, S. S. (eds.) (2012). 9-10.
Real-world tagging datasets have a large proportion of new/ untagged documents. Few approaches for recommending tags to a user for a document address this new item problem, concentrating instead on artificially created post-core datasets where it is guaranteed that the user as well as the document of each test post is known to the system and already has some tags assigned to it. In order to recommend tags for new documents, approaches are required which model documents not only based on the tags assigned to them in the past (if any), but also the content. In this paper we present a novel adaptation to the widely recognised FolkRank tag recommendation algorithm by including content data. We adapt the FolkRank graph to use word nodes instead of document nodes, enabling it to recommend tags for new documents based on their textual content. Our adaptations make FolkRank applicable to post-core 1 ie. the full real-world tagging datasets and address the new item problem in tag recommendation. For comparison, we also apply and evaluate the same methodology of including content on a simpler tag recommendation algorithm. This results in a less expensive recommender which suggests a combination of user related and document content related tags.
Including content data into FolkRank shows an improvement over plain FolkRank on full tagging datasets. However, we also observe that our simpler content-aware tag recommender outperforms FolkRank with content data. Our results suggest that an optimisation of the weighting method of FolkRank is required to achieve better results.
Real-world tagging datasets have a large proportion of new/ untagged documents. Few approaches for recommending tags to a user for a document address this new item problem, concentrating instead on artificially created post-core datasets where it is guaranteed that the user as well as the document of each test post is known to the system and already has some tags assigned to it. In order to recommend tags for new documents, approaches are required which model documents not only based on the tags assigned to them in the past (if any), but also the content. In this paper we present a novel adaptation to the widely recognised FolkRank tag recommendation algorithm by including content data. We adapt the FolkRank graph to use word nodes instead of document nodes, enabling it to recommend tags for new documents based on their textual content. Our adaptations make FolkRank applicable to post-core 1 ie. the full real-world tagging datasets and address the new item problem in tag recommendation. For comparison, we also apply and evaluate the same methodology of including content on a simpler tag recommendation algorithm. This results in a less expensive recommender which suggests a combination of user related and document content related tags.</p> <p>Including content data into FolkRank shows an improvement over plain FolkRank on full tagging datasets. However, we also observe that our simpler content-aware tag recommender outperforms FolkRank with content data. Our results suggest that an optimisation of the weighting method of FolkRank is required to achieve better results.
Collective Information Extraction with Context-Specific Consistencies.Klügl, Peter; Toepfer, Martin; Lemmerich, Florian; Hotho, Andreas; Puppe, Frank in Lecture Notes in Computer Science, P. A. Flach, Bie, T. D., Cristianini, N. (eds.) (2012). (Vol. 7523) 728-743.
Datenschutz im Web 2.0 am Beispiel des sozialen Tagging-Systems BibSonomy.Krause, Beate; Lerch, Hana; Hotho, Andreas; Roßnagel, Alexander; Stumme, Gerd in Informatik Spektrum (2012). 35(1) 12-23.
Face-to-Face Contacts at a Conference: Dynamics of Communities and Roles.Atzmueller, Martin; Doerfel, Stephan; Hotho, Andreas; Mitzlaff, Folke; Stumme, Gerd in Modeling and Mining Ubiquitous Social Media (2012). (Vol. 7472)
Proceedings of the Third International Workshop on Mining Ubiquitous and Social Environments (MUSE 2012)Atzmueller, Martin; Hotho, Andreas (2012). Workshop Notes, Bristol, UK.