# A recommender for species interactions

## What are we having for dinner?

Ecological networks are not that complicated. To illustrate this, we will see
how easily we can use *Julia* to build a recommender system to suggest suitable
preys to predators, using the k nearest neighbors algorithm. This post is
largely based on @DesjLaig17, which I suggest you take a brief look at (it’s
free as in we paid the APC).

The data we have is a list of preys eaten by various predators, which we can get
from *GitHub* and store locally:

```
# Prepare a folder to store the data
ispath("data") || mkdir("data")
data_folder = joinpath("data", "knn-interactions")
ispath(data_folder) || mkdir(data_folder)
# Download the data
_project_root = "https://raw.githubusercontent.com/PhDP/EcoInter/master/KNN/data/"
int_file = joinpath(data_folder, "interactions.csv")
isfile(int_file) || download(_project_root*"mercure-interactions.csv", int_file)
```

Every line is a comma separated list of preys, so we can simply turn it into a set, after filtering out the lines with no preys on them:

```
preys = Set.(split.(filter(l -> length(l)>0, readlines(int_file)), ", "));
```

Let’s check the diet of the first species:

```
first(preys)
```

```
Set(SubString{String}["405", "432", "392", "391"])
```

In order to recommend suitable preys, we need to have a measure of distance. We will use one based on the Tanimoto similarity, which is

$$ \text{Tanimoto}(x,y) = \frac{|x\cap y|}{|x\cup y|} $$

Note how close the function is to the mathematical notation, thanks to the unicode goodness:

```
tanimoto(x::Set{T}, y::Set{T}) where {T} = length(x∩y)/length(x∪y)
```

```
tanimoto (generic function with 1 method)
```

As this is a similarity measure, we will turn it into a distance by substracting it from 1:

```
D(x::Set{T}, y::Set{T}) where {T} = 1.0 - tanimoto(x,y)
```

```
D (generic function with 1 method)
```

We now have all we need to perform our analysis! Specifically, we want to
suggest suitable preys to an organism, based on what organisms with similar
diets are eating. To do this, we will define “similar” as being the *k* closest
neighbors, and then we will recommend preys based on how common they are in the
diet of these *k* neighbors. If we need a small number of neighbors to get a
good result, then networks are not actually that complicated. Let’s start.

First, because we will use random removal of a species to test our approach, we need to limit ourselves to species with more than 2 preys:

```
filter!(d -> length(d)>2, preys);
```

For each of the remaining species, we want to pick one of its preys, remove it,
and then look for the closest neighbors absed on the incomplete diet. We can do
this using `sample`

, so let’s load `StatsBase`

:

```
import StatsBase
```

Let’s define a function to remove a species from the diet. We could do this by
adding a method to `StatsBase.sample`

, which would be:

```
StatsBase.sample(s::T, i::Integer; k...) where {T <: Set} = Set(sample(collect(s), i; k...))
```

But in tour case, we want to keep track of the identity of the species that was removed, so as to see if our recommendation makes sense, and so we will use something slightly more (but barely) complex:

```
function leave_one_prey_out(x::T) where {T <: Set}
kept_preys = StatsBase.sample(collect(x), (length(x)-1); replace=false)
lost_prey = filter(x -> !(x in kept_preys), collect(x))
return Set(kept_preys), first(lost_prey)
end
leave_one_prey_out(preys[1])
```

```
(Set(SubString{String}["405", "392", "391"]), "432")
```

We can now get the distances between all *other* predators and the remaining
preys in the diet:

```
kept, lost = leave_one_prey_out(preys[1])
distances = [D(kept, p) for p in preys[2:end]];
```

The `distances`

array has the distance between the incomplete diet, and the
other (complete) ones. We now need to get the positions that correspond to the
closest *k* neighbors; let’s assume we use $k=5$:

```
k = 5
closest = partialsortperm(distances, 1:k)
```

```
5-element view(::Array{Int64,1}, 1:5) with eltype Int64:
1
6
8
9
12
```

At this point, were we doing a real research project, we could shuffle diets to avoid re-selecting the earliest ones, etc, but this will do for now. The closest neighbors diets are:

```
closest_diets = preys[2:end][closest]
```

```
5-element Array{Set{SubString{String}},1}:
Set(["405", "432", "392", "391"])
Set(["405", "432", "392", "391"])
Set(["405", "432", "392", "391"])
Set(["405", "432", "392", "391"])
Set(["405", "432", "392", "391"])
```

So, this is easy! Our species of interest has 5 closest neighbors with the same diet. To see which one we would recommend, let’s still write some code to count their frequencies:

```
recommendations = StatsBase.countmap(vcat(collect.(closest_diets)...))
```

```
Dict{SubString{String},Int64} with 4 entries:
"405" => 5
"432" => 5
"392" => 5
"391" => 5
```

At this point, a common question is *How often would the correct species be
recommended in the top $n$ choices*? So let’s be crafty:

```
n = 5
picks = zeros(Bool, n)
for rec_number in 1:n
if length(recommendations) > 0
# We pick the group with the most votes
cmax = filter(f -> f.second == maximum(values(recommendations)), recommendations)
# Then we check if the species we removed is in it
picks[rec_number] = lost in collect(keys(cmax))
# Finally we remove all preys with the most vote
filter!(f -> f.second < maximum(values(recommendations)), recommendations)
end
end
```

If the correct prey was found, then the `picks`

array will have a non-false
value somewhere, so we can say that a success in prediction is when:

```
any(picks)
```

```
true
```

Let’s now wrap this up into a function, for the entire dataset:

```
function dinnertime(preys; k::Int64=3, n::Int64=5)
@assert 1 ≤ k ≤ (length(preys)-1)
@assert 1 ≤ n ≤ 20 # This is arbitrary
success = zeros(Bool, length(preys))
for (i,diet) in enumerate(preys)
kept, lost = leave_one_prey_out(diet)
other_preys = preys[filter(idx -> idx ≠ i, 1:length(preys))]
distances = [D(kept, p) for p in other_preys]
closest = partialsortperm(distances, 1:k)
closest_diets = other_preys[closest]
recommendations = StatsBase.countmap(vcat(collect.(closest_diets)...))
picks = zeros(Bool, n)
for rec_number in 1:n
if length(recommendations) > 0
# We pick the group with the most votes
cmax = filter(f -> f.second == maximum(values(recommendations)), recommendations)
# Then we check if the species we removed is in it
picks[rec_number] = lost in collect(keys(cmax))
# Finally we remove all preys with the most vote
filter!(f -> f.second < maximum(values(recommendations)), recommendations)
end
end
success[i] = any(picks)
end
return sum(success)/length(success)
end
```

```
dinnertime (generic function with 1 method)
```

Let’s now try this, with the default of $k=3$ and the top 5 votes:

```
dinnertime(preys)
```

```
0.8688046647230321
```

**And that’s it**! There are a bunch of things we would want to fix to actually
*use this - yet, with only three neighbors, we can recommend the correct preys
*amongst the first five picks about 85% of the time. As we discuss in
*@DesjLaig17, things get even better when we use some information on traits, but
*there is an impressive amount of information stored in the networks themselves.
*And extracting it takes about a dozen lines of code.