Database manipulation with pandas¶

You can use any database library to manipulate data in the ARIEL database. In this example we will be using pandas.

[2]:

# Import the libraries
import pandas as pd
import matplotlib.pyplot as plt
import sqlite3
import numpy as np

1. How to get data about the experiment¶

As explained above, ARIEL uses an SQL database to store the data for the individuals. This means that we have a complete file of every individual that existed during the evolution process whether they died or continued to live until the final generation.

1.1 Database to pandas dataframe¶

All data about the individuals is stored in the individual table. The following line shows you how to turn the database into a pandas DataFrame, which then allows you to manipulate and use the data however you want.

[3]:

data = pd.read_sql("SELECT * FROM individual", sqlite3.connect("__data__/database.db"))

[4]:

data.head(5)

[4]:

	id	time_of_death	fitness_	genotype_	tags_
0	1	7	21.507857	[54.5476194721131, 18.787697613687754, 87.9664...	{}
1	2	4	21.764198	[-23.51735411458668, 83.29895497528028, -50.73...	{}
2	3	2	21.364822	[-41.8642899864318, -26.625290981240703, 21.46...	{}
3	4	4	21.965413	[-2.399348473982663, 32.83936881495799, 49.437...	{}
4	5	1	21.852740	[40.97966824037182, -45.6694431228661, -49.292...	{}

2. Get population per generation¶

The way ARIEL keeps track of individuals is not in the form of a list for each generation like it is more usually used in EAs. Normally in EAs, the population is represented as a list holding the “current alive population”. This means that if you want more detailed data on individuals that did not survive to the end of the evolution loop you do not have that data.

Since we have kept track of all individuals to ever exist in the experiment in the database, we can pull that data and use it to get statistics about the experiment. But, since we don’t have the list of individuals per generation, we need to make it.

We can do that using the time_of_birth and time_of_death values for each individual.

[5]:

# Get minimum and maximum generation number
min_gen = int(data['time_of_birth'].min())
max_gen = int(data['time_of_death'].max())

# Get all individuals that are alive in each generation
population_per_gen = {
    gen: data.loc[(data['time_of_birth'] <= gen) & (data['time_of_death'] > gen), 'id'].tolist()
    for gen in range(min_gen, max_gen + 1)
}

# Structure dataframe for easier viewing
pop_df = pd.DataFrame({
    'generation': list(population_per_gen.keys()),
    'individuals': list(population_per_gen.values()),
    'pop size': [len(v) for v in population_per_gen.values()]
    })

pop_df.head(4)

[5]:

	generation	individuals	pop size
0	0	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...	100
1	1	[1, 2, 3, 4, 7, 8, 9, 10, 12, 15, 20, 21, 22, ...	100
2	2	[1, 2, 4, 7, 9, 10, 15, 21, 22, 25, 26, 32, 34...	100
3	3	[1, 2, 4, 9, 10, 15, 21, 22, 25, 26, 32, 35, 3...	100

2. Computing fitness statistics per generation¶

Now that we know which individuals were alive in each generation, we can compute:

Mean fitness
Standard deviation
Best (minimum) fitness

[6]:

# Get the fitness of every individual
fitness_by_id = data.set_index('id')['fitness_']

# Initialise lists
means = []
stds = []
maxs = []

# Get data from dataframe
for gen in pop_df['generation']:
    ids = population_per_gen.get(int(gen), [])
    if not ids:
        means.append(np.nan)
        stds.append(np.nan)
        maxs.append(np.nan)
        continue

    fits = fitness_by_id.reindex(ids).dropna().astype(float).values

    # If there is no data, initialise it as nan to avoid erors
    if fits.size == 0:
        means.append(np.nan)
        stds.append(np.nan)
        maxs.append(np.nan)

    # If there is data, add it to the list
    else:
        means.append(float(np.mean(fits)))
        stds.append(float(np.std(fits, ddof=0)))
        maxs.append(float(np.min(fits)))

# Add data to separate columns
pop_df['fitness_mean'] = means
pop_df['fitness_std'] = stds
pop_df['fitness_best'] = maxs

# Show dataframe
pop_df.head(4)

[6]:

	generation	individuals	pop size	fitness_mean	fitness_std	fitness_best
0	0	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...	100	21.666660	0.273939	20.806873
1	1	[1, 2, 3, 4, 7, 8, 9, 10, 12, 15, 20, 21, 22, ...	100	21.399773	0.664911	19.846392
2	2	[1, 2, 4, 7, 9, 10, 15, 21, 22, 25, 26, 32, 34...	100	21.261569	0.748077	19.952315
3	3	[1, 2, 4, 9, 10, 15, 21, 22, 25, 26, 32, 35, 3...	100	21.159823	0.801033	19.862954

3. Plotting¶

Now that we have the mean and best fitness and standard deviation per generation, we can use them to plot the fitness progression

[7]:

# Create a copy of dataframe
df = pop_df.copy()

# Create mask to avoid missing data
# if the evolution process was done correctly this should not change anything.
# This mask could be used to get custom data out of the dataframe
mask = df['fitness_mean'].notna()

# Get data according to the mask
x = df.loc[mask, 'generation']
mean = df.loc[mask, 'fitness_mean']
std = df.loc[mask, 'fitness_std']
maxv = df.loc[mask, 'fitness_best']

# Generate line plot of fitness per generation
plt.figure(figsize=(10, 5))
plt.plot(x, mean, label='Fitness mean', color='C0', linewidth=2)
plt.plot(x, maxv, label='Fitness best', color='C1', linestyle='--', linewidth=2)
plt.fill_between(x, mean - std, mean + std, color='C0', alpha=0.25, label='Mean ± Std')
plt.xlabel('Generation')
plt.ylabel('Fitness')
plt.title('Fitness statistics per generation')
plt.legend()
plt.grid(alpha=0.3)
plt.yticks(range(0, int(max(df["fitness_mean"]) + 5), 2))
plt.tight_layout()
plt.show()

../../_images/source_Db_examples_db_with_pandas_12_0.png