Analytics Datasets: Commons Impact Metrics

Overview

This collection of datasets details how commons media is edited, used, and accessed across Wikimedia projects. Currently, we are publishing data about categories on an allow-list, curated jointly with the GLAM community.

The files available for download are all in TSV (tab-separated-value) format, with lists are separated by the "|" character. Schemas for each dataset are detailed below.

Download Commons Impact Metrics Data

back to all Analytics Datasets

Category metrics snapshot

Field Description
category The name of the category this row refers to. Coincides with the page title of the category page in Commons. URL version (with underscores).
primary_categories The top ancestor category names of this row’s category. They should be in the Commons institution category allow-list. Ideally, there should be only one primary category, but since we can not control that from MediaWiki, we accept multiple primary categories. The list is separated by the bar “|” character.
media_file_count The number of media files contained in this (shallow) category.
media_file_count_deep The number of media files contained in this (deep) category tree.
used_media_file_count The number of media files from this (shallow) category featured in at least one wiki page.
used_media_file_count_deep The number of media files from this (deep) category tree featured in at least one wiki page.
leveraging_wiki_count The number of wikis featuring at least one of this (shallow) category’s media files.
leveraging_wiki_count_deep The number of wikis featuring at least one of this (deep) category tree’s media files.
leveraging_page_count The number of (namespace=0) pages featuring at least one of this (shallow) category’s media files.
leveraging_page_count_deep The number of (namespace=0) pages featuring at least one of this (deep) category tree’s media files.
month The month after the end of which we calculate the data (YYYY-MM). For example, if we are calculating the data after March 2024 (even if it’s i.e. April 4th) the value should be “2024-03”. This is so, to be consistent with the sibling incremental datasets (Pageviews by category, Pageviews by media file, and Edits).

Media file metrics snapshot

Field Description
media_file The name of the media file this row refers to. Coincides with the page title of the media file page in Commons. URL version (with underscores).
media_type The media type of the media file, coming from the Image table (img_media_type): BITMAP, VIDEO, etc.
categories The category names that the media file is directly associated with. The list is separated by the bar “|” character.
primary_categories The top ancestor category names of the media file. They should be in the Commons institution category allow-list. Ideally, there should be only one primary category, but since we can not control that from MediaWiki, we accept multiple primary categories. The list is separated by the bar “|” character.
leveraging_wiki_count The number of wikis featuring this media file at least in one (namespace=0) page.
leveraging_page_count The number of (namespace=0) pages featuring this media file across all wikis.
month The month after the end of which we calculate the data (YYYY-MM). For example, if we are calculating the data after March 2024 (even if it’s i.e. April 4th) the value should be “2024-03”. This is so, to be consistent with the sibling incremental datasets (Pageviews by category, Pageviews by media file, and Edits).

Pageviews by category

Field Description
category The name of the category this row refers to. Coincides with the page title of the category page in Commons. URL version (with underscores).
category_scope Either “shallow” (meaning only media files directly associated with the category were used to aggregate pageviews) or “deep” (meaning all media files within the category and all its recursive subcategories were used to aggregate pageviews).
primary_categories The top ancestor category names of this row’s category. They should be in the Commons institution category allow-list. Ideally, there should be only one primary category, but since we can not control that from MediaWiki, we accept multiple primary categories. The list is separated by the bar “|” character.
wiki The canonical name of the visualized wiki, i.e.: “en.wikipedia” or “fr.wiktionary”. Only wikis that feature at least one media file of the corresponding category will appear here.
page_title The title of the visualized (namespace=0) page. URL version (with underscores). Only (namespace=0) pages featuring at least one media file of the corresponding category will appear here.
pageview_count Aggregated pageview count for (namespace=0) pages featuring at least one media file from the category/scope. Rows with pageview_count=0 should be omitted!
month The month for which we aggregate the data (YYYY-MM).

Pageviews by media file

Field Description
media_file The name of the media file this row refers to. Coincides with the page title of the media file page in Commons. URL version (with underscores).
categories The category names that the media file is directly associated with. The list is separated by the bar “|” character.
primary_categories The top ancestor category names of the media file. They should be in the Commons institution category allow-list. Ideally, there should be only one primary category, but since we can not control that from MediaWiki, we accept multiple primary categories. The list is separated by the bar “|” character.
wiki The canonical name of the visualized wiki, i.e.: “en.wikipedia” or “fr.wiktionary”. Only wikis that feature the media file at least once will appear here.
page_title The title of the visualized (namespace=0) page. URL version (with underscores). Only (namespace=0) pages featuring the media file will appear here.
pageview_count Aggregated pageview count for (namespace=0) pages featuring the media file. Rows with pageview_count=0 should be omitted!
month The month for which we aggregate the data (YYYY-MM).

Edits

Field Description
user_name The user name of the user who performed the edit. This is resolved from the actor table’s actor_name. If no actor is found, it is set to ‘anonymous’. If it has been suppressed, it is set to ‘redacted’.
edit_type Either “create” (for the first revision of a media file page), or “update” (for all other revisions of the media file page).
media_file The name of the edited media file. Coincides with the page title of the media file page in Commons. URL version (with underscores).
categories The category names that the media file is directly associated with. The list is separated by the bar “|” character.
primary_categories The top ancestor category names of the media file. They should be in the Commons institution category allow-list. Ideally, there should be only one primary category, but since we can not control that from MediaWiki, we accept multiple primary categories. The list is separated by the bar “|” character.
dt The timestamp of the edit.

back to all Analytics Datasets


All Analytics datasets are available under the Creative Commons CC0 dedication.