pm21-dragon/exercises/release/exercise-07/2__pandas_intro.ipynb
2024-11-25 08:20:05 +01:00

164 KiB

None <html> <head> </head>
In [ ]:
# You must run this cell, but you can ignore its contents.

import hashlib

def ads_hash(ty):
    """Return a unique string for input"""
    ty_str = str(ty).encode()
    m = hashlib.sha256()
    m.update(ty_str)
    return m.hexdigest()[:10]
In [ ]:
import pandas as pd

Let's get started by making a sample dataframe with fake data:

In [ ]:
sample_df = pd.DataFrame({'number':[1,2,3,234,2,3,2,2,1,2], 'color':['blue','blue','red','red','red','blue','blue','red','green','yellow']})
display(sample_df)

Q1 Create a Series named condition which is true for all rows where the color is red

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
# If this runs without error, it means the answer in your previous cell was correct.
assert ads_hash(condition)=='28612bf01a'

Q2 Create a new DataFrame named red_sample_df which contains only the rows with red color from sample_df.

Hint: use your condition Series from above.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
# If this runs without error, it means the answer in your previous cell was correct.
assert ads_hash(red_sample_df)=='354bd7ec89'
assert ads_hash(sample_df)=='21692a4d62'

Now let's use the DataFrame .groupby() method to find the mean value of number for each color. (Hint: this will be useful later in this exercise.)

In [ ]:
for color, gdf in sample_df.groupby('color'):
    mean_number = gdf["number"].mean()
    print(f"{color} {mean_number}")

Using Pandas to work with real data

Get the CSV file and upload it to Jupyter

  • Go to https://datadryad.org/stash/dataset/doi:10.5061/dryad.vb06d3k and download the data by clicking "Download Dataset".
    • Check that now have a file called doi_10.5061_dryad.vb06d3k__v1.zip about 99KB in size in your Downloads.
    • Unzip this file, which should create:
      • spectra.csv
      • data_onevalueperbee.csv
      • data_visitationsequence.csv
    • We are going focus on the file data_onevalueperbee.csv. Upload this file to your Jupyter server using the Upload button in the file view. Upload it into the same directory as this .ipynb file. Alternatively, if you are running this on your own computer, you can copy this .csv file into the directory with your .ipynb file.

Look at the file

As we have covered, CSV stands for "Comma separated values" and is one of the most widely used file formats for scientific data. CSV files are typically like tables, potentially with column names in the first line. The lines of the file are then rows of the table. Spreadsheet programs can open most CSV files, sometimes with a few settings being required to deal with various CSV dialects.

Let's take a look at this CSV file. This is mostly a repitition of what we recently looked at. We open the CSV file read the first few lines.

In [ ]:
# Open the file in Python. The variable `fd` will be assigned the open file object.
fd = open('data_onevalueperbee.csv', mode="rb")

# Iterate over the lines in the file, also creating an integer called `line_number` via the
# use of the `enumerate()` function.
for (line_number, line) in enumerate(fd.readlines()):
    # Print each line.
    print(line)
    if line_number >= 3:
        # Stop after a few lines
        break
# Close the open file object.
fd.close()

So, note that instead of a comma (,) separating the columns, we have a semicolon (;) instead. So we will need to use this when calling the Pandas read_csv() function. Thus, we must use the sep=';' keywoard argument. Also, this CSV file has some slightly unusual characters in it. From trial and error, I have learned that it must be opened with a encoding='latin_1' keyword argument to read_csv().

Q3. Read the file data_onevalueperbee.csv? Into a dataframe named df.

Hint: use the read_csv function not only with the filename as the required first positional argument, but also with the keyword arguments described above.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
# If this runs without error, it means the answer in your previous cell was correct.
assert ads_hash(df)=='15994f5b0e'
In [ ]:
# Let's look at the first few lines of the file with the pandas DataFrame `.head()` method:

df.head()

Q4. How many rows of data are in the dataframe you read from the file data_onevalueperbee.csv? Put the answer in a variable num_rows.

Hint: you can use len(df) to calculate the number of rows in the DataFrame df.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
# If this runs without error, it means the answer in your previous cell was correct.
assert ads_hash(num_rows)=='ff2ccb6ba4'

Q5. What are the unique values of the nest column?

Put your answer in the variable unique_nests. Hint: use the .unique() method on the Series for the nest column.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
# If this runs without error, it means the answer in your previous cell was correct.
assert ads_hash(unique_nests)=='59e69ce283'

Q6. What are the unique values of the treatment column?

Put your answer in the variable unique_treatments.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
# If this runs without error, it means the answer in your previous cell was correct.
assert ads_hash(unique_treatments) == '670c49c25a'

Now let's check which nests are in each treatment:

In [ ]:
for treatment, gdf in df.groupby('treatment'):
    print(treatment)
    print(gdf['nest'].unique())
    print()

Q7. Now make a seaborn boxplot of the time to start foraging (tstartforaging in the CSV file) on the y axis, nest number on the x axis, and with the color (hue) being the treatment. Save the plot to a file named Figure2a.png.

Your plot should look like:

Check the left panel of Figure 2a from the Lämsä et al. Proc. Roy Soc B. 2018 paper. Do you see any similarities or differences with your plot?

Hint: import seaborn and call its boxplot() function with data, x, y, hue keyword arguments with values df, nest, tstartforaging, treatment.

In [ ]:
# Run this cell to import seaborn
import seaborn as sns
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()

Q8. Now make a plot using a single line using seaborn's boxplot function which shows the duration of the foraging period on the y axis.

This should be fairly similar to the second panel in Figure 2a from the Lämsä et al. Proc. Roy Soc B. 2018 paper.

Hint: the duration of the foraging period is in the 'maxminusstarttime' column.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()

Q9. Make a dictionary called mean_duration_of_foraging_by_treatment. Fill this with items whose keys are treatment strings ('control' and 'imidacloprid') and whose values are the mean of all 'maxminusstarttime' values for rows in which the treatment variable is the same as the key.

Hint: use groupby to group on the treatment column and use the .mean() method on the maxminusstarttime Series within each group data frame. Store each mean as the value in your dictionary for the key.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
# If this runs without error, it means the answer in your previous cell was correct.
assert ads_hash([(k,int(round(v))) for (k,v) in mean_duration_of_foraging_by_treatment.items()]) == '14fe49a61a'

Q10. Make a dictionary num_unique_bees_per_nest which has key-value pairs of nest number and the number of unique bees from that nest.

Hint use .groupby on the 'nest' column.

In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
# If this runs without error, it means the answer in your previous cell was correct.
assert ads_hash(num_unique_bees_per_nest)=='ddc47911c0'
</html>