top of page
  • Anastasiia Navalikhina

A dive into blood-brain barrier permeability prediction models (part 2)

In our first article, we have seen that blood-brain barrier (BBB) has a limited permeability for most chemicals and that getting a drug inside the brain, a step key to drug development process, might be tricky. Some drugs can pass through this barrier passively by diffusion and others can only be delivered by specialized proteins named influx transporters.

In this second article we are going to get our data ready for Machine Learning models which will predict substrates of two major influx transporters, pumping drugs into the brain: OATP1A2 and OATP2B1.

As no data science project can start efficiently without conceptualizing first what we want to achieve (basically : the output we want to predict), and how we are going to achieve it (the features we are going to predict from), let us first see together the thought process that lead us to chose our output and our inputs.

Clarifying our objectives: the output

Remember the big picture: in the initial stages of drug development, predicting the permeability of BBB with the use of Machine Learning (ML) methods can be useful for screening compound libraries. Drugs with required activity can be selected for further tests.

Sometimes, the permeability of BBB for certain drugs needs to be altered. This is where transporter modulators come in use. For example, when the only solution to cure a particular condition is to use a BBB transporter substrate drug, a possible way to overcome associated side-effects is to co-administer the inhibitor of this transporter. This inhibitor could block the transporter and restrict access to the brain for the main drug, avoiding associated adverse effects.

Another common action of modulators is to prevent transporters from performing efflux of the drug from the target organ. This could be useful for instance for a cancer treatment when anti-cancer drugs need to accumulate in the target organ in order to be efficient. In this case, inhibitors can be used to block efflux transporters pumping anti-cancer drugs out of the organ.

Therefore, transporter modulators can potentially be used to rationalize drug distribution in the body. On the other hand, the transporter modulator activity can also be undesirable for drug candidates, and there are numerous examples of adverse effects associated with BBB transporters inhibitors.

As a result, predicting this property at initial stages of drug development can help overcome issues emerging in the further pre-clinical and clinical trial stages. Thus we will focus on predicting substrates of OATP1A2 and OATP2B1, two of the main BBB influx transporters.

Fueling the model : the inputs

The inputs now: a drug ability to pass through a channel or to block a channel is associated with its chemical structure.

Chemical structure can be described with a set of properties such as: quantity of certain atoms, chemical bonds, molecule solubility in water and organic solvents or van der Waals surface area. Such properties, named chemical descriptors, compose the fingerprint of a given molecule. These molecular fingerprints can be used in ML models to predict whether a molecule will have a certain activity (for example, be a substrate of an inhibitor of certain transporter).

We now have the stage set: we will build ML models predicting substrates of OATP1A2 and OATP2B1 based on features extracted from the chemical structure of drugs. We will use an open database to get a list of inhibitors/non-inhibitors and substrates/non-substrates, describe these chemicals with molecular descriptors and use them to train and evaluate our models.

Data preparation

All data used to build the models are obtained from the ChEMBL open database, from which we get 276 chemicals. After cleaning up the points which do not have SMILES representation (we will see what is SMILES later on) or do not have proper labels, we get 260 chemicals which are either substrates or inhibitors of OATP1A2 and OATP2B1 or neither of them.

On one hand, compounds which have more than 50% of inhibition potency are classified as inhibitors. On the other hand, compounds with less than 50% of potency and transporter substrates are classified as non-inhibitors. With this process, we generate a set of 50 samples (14 inhibitors and 36 non-inhibitors) for OATP1A2 and 257 samples (53 inhibitors and 204 non-inhibitors) for OATP2B1.

For the substrate dataset, the 11 compounds which are well-known substrates are classified accordingly, whereas all inhibitors are classified as non-substrate (as inhibitors cannot be substrates of a channel). Doing so, we generate a total set of 252 samples (11 substrates and 241 non-substrates).

Fig. 1 Proportion of substrates, inhibitors of OATP1A2 (A2), inhibitors of OATP2B1 (B1), and non-substrates / non-inhibitors ("zero class") in the dataset samples

Mordred featurization

Drugs in our data sets are now featurized with the help of Mordred descriptors.

This process takes a molecule SMILES as input. SMILES is simply a single-line string representation of a molecule (Fig 2). Using SMILES, Mordred generates a molecular fingerprint consisting of 1613 features (the list of Mordred features and their description can be found here).

Fig. 2 Dexamethasone, the inhibitor of OATP1A2, (a) structure, and (b) SMILES representation.

Rebalancing classes

Now we are facing a situation pretty common in health-applied datascience : all three data sets used here are imbalanced, meaning that they have more instances from class zero (non-substrates and non-inhibitors) than from the class one (substrates and inhibitors) we want to predict.

To build a model able to efficiently predict instances from class one, we need to rebalance the data set.

Rebalancing can be done in two ways: either via undersampling of a major class or via oversampling of a minor class. Oversampling can be achieved by copying instances of a class or by creating new instances with the help of existing ones. We will thus oversample by creating synthetic samples thanks to the SMOTE technique (Chawla 2002).

The result of oversampling for substrates subset can be seen in Fig. 3. Here, we create 37 synthetic instances for the minority “substrate” class which had before the operation only 11 samples. As a result we get 48 samples, and the class ratio in the dataset changed from 0.05 to 0.2.

Fig. 3 PCA-transformed data set of compounds classified as OATP1A2 and OATP2B1 substrates (1) and non-substrates (0) before SMOTE (left) and after SMOTE (right).

There is an additional approach which can be applied to imbalanced datasets: modifying the model parameters rather than the dataset itself. As we know the main objective of our model (classifier) is to correctly predict true “class one” chemicals – substrates and inhibitors of the transporters. This means that we want to have high recall scores for this class (a high rate of predicted positives among all existing positives in our dataset).

A model output being a arbitration between precision and recall, we can sacrifice "zero-class" precision and accept more "class one" false positives than we would normally do. These false positives will be eliminated in further tests during the drug development process. Doing so, we will not miss drugs which can cause adverse effects as true substrates or inhibitors of the BBB transporters. So how do we implement it? We can do this by increasing the cost of incorrect classification using class weights, in Random Forest and Logistic Regression modelling.

Great! Now we got our data straight! Let us predict. In our next and final article we are going to build and fit our Machine Learning Models and evaluate the results together. Stay tuned!


* Follow us on LinkedIn for next blog updates:

* Interested in our skills? Let's discuss your projects together:


* Our public Github repository:



  1. N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of artificial intelligence research, 321-357, 2002.


bottom of page