Data Science Chair

    Towards Predicting the Subscription Status of Twitch.tv Users — ECML-PKDD ChAT Discovery Challenge 2020


    Together with researchers from Leipzig University and Bauhaus University Weimar, we organized this year's ECML-PKDD discovery challenge called "ChAT", which is an acronym for "Chat Analytics for Twitch".

    Twitch.tv is a live-streaming platform for videos, mainly used by gamers to share their gameplay and an audio commentary with the community. Users can then interact with the streamer and other users by writing comments in the chat section. We already worked with Twitch chat messages in our "Emote-Controlled" paper from earlier this year, that features more information on Twitch and its characteristics. On Twitch, users can support their favorite streamers by subscribing to their channel. This costs a monthly fee and unlocks certain subscriber-only features, such as exclusive emotes or access to certain chat rooms. Our main research question for the "ChAT" discovery challenge was: "Is there a difference in chat and interaction behavior between subscribed and not-subscribed users?" Therefore, we let participants of the challenge build a binary classifier that predicts the subscription status of user-channel combinations based on the chat messages written by the users in the channel. Obtaining a well working classifier would allow to identify users that are currently not subscribed to a channel but behave like they are subscribed. Such users could then be targeted using advertising.

    For the challenge, we provided more than 400 000 000 public Twitch chat messages from English channels as a training set, containing additional metadata such as the timestamp or game that was streamed when the comment was written. We evaluated the participants' models on 90 000 unseen user-channel combinations and analyzed their predictions to get a better feel for the strengths and weaknesses of their approaches. Participating teams extracted stylometric features, user and channel activity features, as well as user-channel interaction features from the data and applied different models (e.g. LSTMs, XGBoost, or CatBoost) to predict the subscription status.

    The challenge started in March 2020 and the model submission deadline was in June. From the 23 team registrations, four submitted a working model and three provided a descriptive paper. If you are interested in the task and data, the participants' approaches, or our analysis regarding the results, take a look at the challenge's website and the paper proceedings published at CEUR-WS. If you are going to visit this year's (virtual) ECML-PKDD conference next week (14-18.09.2020), you are welcome to be part of our session on Friday, 18.09.2020, at 15:00 CEST that will feature talks and Q&As of all teams.