Q&A
Nedan finns svar på vanliga frågor kring korpusen MMWAH. Om du inte finner svar på din fråga ber vi dig kontakt oss personligen på martti.makinen@hanken.fi eller ines.frojdo@hanken.fi.
What is MMWAH?
The corpus MMWAH stands for the Multilingual Multimodal WhatsApp corpus Hanken. It is a curated text collection consisting of WhatsApp chats conducted among Finnish-Swedes in the age group 18-30.
The chats have been voluntarily donated to the corpus in connection with the language research project Instant Messaging in Multiple Languages: focus on WhatsApp in Finland-Swedish digital communication.
What is a corpus?
Corpora are collections of texts created for the purpose of research according to certain criteria, such as text type, language, or genre. They form the basis of the majority of modern linguistic research.
Corpora are also used in computer science for NLP - natural language processing - enabling large language models used for ChatGPT and similar applications.
Who are in the research team?
Martti Mäkinen is the project manager and Ines Fröjdö is the research assistant. In the first period of the project, Leyla Shojaeifard also worked with us on the technical solutions and data management.
Why study Finnish-Swedish digital communication?
As speakers of a minority language, Finland-Swedes are able to use multiple languages to navigate Finland-Swedish society. In MMWAH, the linguistic skills of the speakers are reflected, for example, in switching between languages.
Digital communication on platforms such as WhatsApp combines features of written and spoken language use. In traditional contexts, Finnish-Swedes tend to follow the rules of writing in standard Swedish, and the unique Finnish-Swedish features are often lost. However, we often retain these features in our less formal everyday conversations. The multimodal tools used on communication platforms such as WhatsApp, i.e. emojis, audio messages or memes, distinguish this form of language from the other documented variants of Finnish-Swedish language use.
What is the goal?
As a result of MMWAH, mapping of natural language mixtures between Swedish, Finnish and English will be made possible. The coexistence of languages and everyday code-switching between languages at present is a relevant issue in linguistics. The corpus will also capture the change in stylistic phenomena in digital language environments, such as punctuation and emojis.
In short, we are creating material for research on Finland-Swedish identity that will be available to other researchers according to the principles of Open Science. In this way, linguistic changes and phenomena specific to digital communication among Finland-Swedes are captured.
How do i contribute?
- Donate a chate
- Open chat in the WhatsApp-app
- Click Settings > More > Export Chat > Include media
- Email the material to mmwah@hanken.fi
- Consent to research participation through form
- Answer short questionnaire on linguistic background (approx. 5 min.)
- Edit the donation (if need be)
The research team will anonymise the donated material once informed consent has been collected from each chat participant. The participants may remove data that they do not want to be included in the research from the donated material before the processing of the data has begun.
Why should I participate?
Data on the use of Swedish in Finland is needed to map and, above all, record the language as it is currently used. The language is changing rapidly and without research material it is not possible to study the changes or trends in the language. To create as complete a representation of the language as possible, we need to reach many different language users from different backgrounds.
Is my data good to donate?
Short answer: Yes!
As long as there is someone among the chat participants that identifies themselves as a Swedish speaker the data is of interest to us. Note that the discussion does not need to take place in English: we are interested in Swedish speakers' use of linguistic resources in digital kommunikation in Finland, irrespective of the language.
Re chats, both long and short discussions provide valid data (the minimum is 20 messages per chat). You do not need to worry about the content as all language use is of interest to MMWAH. Language research focuses on how thoughts and ideas are expressed; the content can really be anything. The chats can deal with everyday, mundane things as that is really what we wish to see: the simple, daily language use.
The donated chat can include multimodal elements, e.g. pictures, videos, or voice recordings. They will be anonymised in a similar manner as the text of the chat.
What kind of chats can I donate?
Friend chats, group chats, sports team chats and the like are all suitable for the MMWAH corpus. As long as we can contact all the individual chat participants for informed consent, you can donate whichever chat you like. Thus the number of chat participants can be anything between 2-20 persons. Please check with the other participants before you donate; that will increase the likelihood of a successful donation.
As said earlier, even if the focus of the project is Finnish Swedish language use, it does not mean that the (main) language of the chat must be in Swedish. Mixed language chats are equally valuable. We welcome all languages and all mixes of languages, provided that we can carry out a secure anonymisation process on the text.
We collect first and foremost language data from persons between 18 - 30 years of age. Nevertheless, there can be chat participants outside this age group; that is not a problem. They must be at least 15 years of age to be able to give their informed consent without their parents approval.
The donated chat can include multimodal elements, e.g. pictures, videos, or voice recordings. They will be anonymised in a similar manner as the text of the chat.
Can I be identified by the chat I donated?
Users of the finalised corpus will not be able to identify the donors of the material in the corpus. The content will be pseudonymised (personal names have been replaced by code names) and anonymised (identifiable content deleted). Once the individual chats have been processed and anonymised, they will be aggregated into the corpus. Donors should feel confident that their data cannot be linked back to them.
The research team will collect consent and essential background information from each of the people participating in donated chats. The background data will enable the corpus to be filterable, allowing users to search for messages by, for example, age group, geographical area or the speaker's native language. Participants remain anonymous in the corpus even when carefully selected metadata is published with the corpus.
Vad sker om jag ångrar mitt deltagande?
Det är möjligt att återkalla ditt samtycke att delta i projektet. Ifall du ångrar ditt deltagande kan du kontakt oss och be att vi raderar materialet du skänkt eller de instanser där du är författaren bakom meddelanden. I samband med detta raderas även de enkäter och kontaktuppgifter vi samlat av dig.
Ifall du vill återkalla ditt samtycke ber vi dig kontakta oss via e-post.
Om presentkort
Presentkort på 20 euro utdelas åt chattdeltagare som tack för besväret. Efter att vi mottagit chattdonationen skickas presentkort ut manuellt till den e-postadress som angetts i samtyckesblanketten med några dagars dröjsmål. Enbart chattdeltagare som väljer att medverka och därmed ger samtycke får presentkort. Antalet presentkort är begränsat till ett per person även om man väljer att donera flera chattar.
Presentkort finns till följande affärer: Gigantti, Google Play, Ruohonjuuri, Suomalainen kirjakauppa, Verkkokauppa, XXL, och Zalando.