Who’s the painter?

July 9, 2019 Maël Fabien Data Enrichment

Better features, better data

In this article, we will be using data from the Web Gallery of Art, a virtual museum and searchable database of European fine arts from the 3rd to 19th centuries. The gallery can be accessed here.

We will create an algorithm to predict the name of the painter based on an initial set of features of the painting, and then gradually including more and more, thus improving the feature engineering, and including pictures.

Through this article, we will illustrate:

  • The importance of good feature engineering
  • The importance of data enrichment
  • The impact this can have on accuracy

Ready? Let’s get started!

The data

To download the data, you can either :

  • click on this link to download the XLS file directly
  • go to Database tab in the website, and click on the last link : You can download the catalog for studying or searching off-line. Select the Excel format of 5.2 Mb.

Start off by importing several packages to be used later:


The architecture of our folders should be the following:


Images is an empty folder to be used later.

Import the file

catalog.xlsx :


We directly notice that we need to process the data, to make it exploitable. The available columns are :

  • The author, which we will try to predict
  • The date of birth and death of the author. We will drop this column since it is directly linked to the author
  • The title of the painting
  • The date of the painting, if available
  • The technique used (Oil on copper, oil on canvas, wood, etc) as well as the size of the painting
  • The current location of the painting
  • The URL of the image on the website
  • The form (painting, ceramics, sculpture, etc). We will only focus on paintings in our case.
  • The type of painting (mythological, genre, portrait, landscape, etc.)
  • The school, i.e. the dominant painting style
  • The time frame in which experts estimate that the painting was painted

Feature engineering

Since the dataset itself is not made for running a ML algorithm, but meant to be a simple catalog, we need some processing.


By exploring the data, we notice missing values for the date. When the date is approximate, it is denoted by :

  • 1590s
  • or c.1590

Moreover, the missing values are denoted by a hyphen. For all these reasons, using a regex to extract the date seems to be appropriate.


The time frame is redundant if the date is known. Including both variables would imply adding multi-co-linearity in the data.



The “Technique” is an interesting feature. It is a string that takes the following form:

Oil on copper, 56 x 47 cm

We can extract several elements from this feature:

  • The type of painting (oil on copper)
  • The height
  • The width

We will only focus on paintings, and drop observations that are sculptures or architecture for example.


We can apply several functions to extract the width and height:


Width and height

In some cases, the “Technique” feature does not contain the width nor the height. We might want to fill the missing values. It’s not a good idea to fill it with 0’s. To minimize the error, we’ll set the missing values to the average of each feature.

Missing values and useless columns

As stated above, we won’t exploit the birth nor death of the author, since it’s an information that depends on the author.

At this point we can confidently drop any row that has missing values since the processing is almost over.

There are many authors in the database (> 3500). To check this, simply run a values count on the author’s feature.

We will need a good number of training samples for each label for the algorithm to be applied. For this reason, all authors with less than 200 observations should be dropped. This is a major limitation in our simple model, but will give a better class balance later on.


A first model

The aim of this exercise is to illustrate the need for a good feature engineering and additional data. We won’t spend too much time on the optimization of the model itself and we will use a random forest classifier. A label encoding needs to be applied to transform the labels into numeric values that can be understood by our model.

The accuracy of a model will be evaluated by the average of the cross validation with 5 folds.

The mean accuracy during our cross validation reaches 81.1% with our simple random forest model. We can also look at the confusion matrix.

It’s easy to understand that mistakes are made more frequently with latest authors, given that we have fewer observations for these.

More feature engineering


Alright, we are now ready to move on and add other variables by improving the feature engineering. Looking at the “Technique” feature, you will notice that we have not used the “type of painting” variable yet. Indeed, only the width and height have been extracted from this field.

The technique is systematically specified before the first comma. We will split the string on the first comma, if there is one, and then select the first word (oil, tempera, wood…).


So far we have not exploited the location field either. The location describes where the painting is being kept. We only extract the name of the city from this field as extracting the name of the museum would lead to an overfitting. The collections of each museum are limited, and we only have at this point around 4’500 training samples.

Second model

After adding these two variables, we can test again the outcome on a cross validation.

Then, run the cross validation:

And print the confusion matrix:

We have gained a significant accuracy by improving the feature engineering!

Process the title

Can the processing of the title bring additional accuracy ? It might be interesting to :

  • Embed the title using a pre-trained model
  • Reduce the dimension of the embedding using a Principal Component Analysis (PCA)
  • Use the new dimensions as new features to predict the name of the painter

To start, download pre-trained models from Spacy from your terminal :

python -m spacy download en_core_web_md

We will be using a pre-trained Word2Vec model and begin by defining the embedding function:

We then apply our function to the list of titles:

We will now reduce the dimension (300 currently) of the embedding to use it as features in our prediction. The Principal Component Analysis (PCA) is sensitive to scaling, and requires a scaling of the embedding values:

We can apply the PCA on the rescaled data and see what percentage of the variance we are able to explain:

This is a tricky situation. Adding more dimensions seems to smoothly improve the percentage of the explained variance, up to 200 features. This might happen if the embeddings are too similar since the Word2Vec model has been trained on a corpus that uses a more general vocabulary, e.g. “Scenes from the Life of Christ” and “Christ Blessing the Children” will tend to have similar average embeddings.

To confirm this thought, we can try to plot on a scatterplot the embeddings reduced to 2 dimensions by PCA.

There seems to be no real clustering effect, although a K-Means algorithm could probably detach 3-4 clusters.

We might expect the new features derived from the embedding not to improve the overall accuracy.

This is indeed the case. Then, should we include the title variable ? A cool feature of the random forest is to be able to apply a feature importance. By checking the feature importance, we notice how many node splits depend on values encountered on a given feature.

The 2 features extracted by the PCA on the embedding are the most important. Including them at that point might not be a good idea as we would need to fine-tune the Word2Vec embedding for our use case. A similar approach with a PCA on a Tf-Idf has been tested and has given similar results.

This highlights a major limitation in the dataset itself. This open source catolog focuses on European art between the 3rd and the 19th century, and maily includes religious art. Therefore, the titles, the pictures and certain characteristics are quite similar across artists. Pre-trained models require fine-tuning, and feature engineering needs to be done wisely.

Exploiting the images

The URL column contains a link to download the images. By clicking on a link, we access the webpage of the painting.

If you click on the image, you can notice how the URL changes. We now have direct access to the image :

In this example, the URL just went from:


All we need to do is process the URLs so they fit the second template.

We are now ready to download all the images. First, create an empty folder called images and enter the following script to fetch images from the website directly:

Depending on your WiFi and server response time, it might takes several minutes/hours to download the 4488 images. It might be a good idea to add a time.sleep(1) within the for loop to avoid errors. At this point, we are faced with the problem where each image has a different size and resolution. We need to scale down the images, and add margins in order to make them all look square.

To further reduce the dimension we only use the greyscale version of the images:

Run this script to reduce the dimensions of the images to 100 × 100 and add margins if needed. We are using OpenCV’s resize function in the loop :

The images have been reduced to a dimension of 100×100, but that’s still 10’000 features to potentially include in the original dataset, and including a value pixel by pixel won’t make much sense. PCA finds the eigenvectors of a covariance matrix with the highest eigenvalues. The eigenvectors are then used to project the data into a smaller dimension. PCA is commonly used for feature extraction.

Many techniques of computer vision could be applied here but we will simply apply a PCA on the image itself.

The number of components to extract has been tested empirically, and 1 component gave additional accuracy :

Half a percentage of accuracy is gained by adding the PCA of the image as a feature.


We can summarize by saying that this article shows how a good feature engineering and external data sources can improve the accuracy of a given model.

1Simple feature engineering0.81117
2Improved feature engineering0.84084
3Add embedding of the title0.83197
4Add PCA of the images0.84416

We improved the accuracy by up to 3.3%. There still is room for better models, deep learning pipelines, computer vision techniques and fine-tuned embedding techniques

Subscribe Today! Get the latest updates with our newsletter.
We promise you'll love it.

Follow us

Just announced! Explorium Announces $31M in Series B Funding to Accelerate Growth Read more