164 KiB
# You must run this cell, but you can ignore its contents.
import hashlib
def ads_hash(ty):
"""Return a unique string for input"""
ty_str = str(ty).encode()
m = hashlib.sha256()
m.update(ty_str)
return m.hexdigest()[:10]
import pandas as pd
Let's get started by making a sample dataframe with fake data:
sample_df = pd.DataFrame({'number':[1,2,3,234,2,3,2,2,1,2], 'color':['blue','blue','red','red','red','blue','blue','red','green','yellow']})
display(sample_df)
Q1 Create a Series named condition
which is true for all rows where the color is red¶
# YOUR CODE HERE
raise NotImplementedError()
# If this runs without error, it means the answer in your previous cell was correct.
assert ads_hash(condition)=='28612bf01a'
Q2 Create a new DataFrame
named red_sample_df
which contains only the rows with red color from sample_df
.¶
Hint: use your condition
Series from above.
# YOUR CODE HERE
raise NotImplementedError()
# If this runs without error, it means the answer in your previous cell was correct.
assert ads_hash(red_sample_df)=='354bd7ec89'
assert ads_hash(sample_df)=='21692a4d62'
Now let's use the DataFrame .groupby()
method to find the mean value of number
for each color
. (Hint: this will be useful later in this exercise.)
for color, gdf in sample_df.groupby('color'):
mean_number = gdf["number"].mean()
print(f"{color} {mean_number}")
Using Pandas to work with real data¶
Get the CSV file and upload it to Jupyter¶
- Go to https://datadryad.org/stash/dataset/doi:10.5061/dryad.vb06d3k and download the data by clicking "Download Dataset".
- Check that now have a file called
doi_10.5061_dryad.vb06d3k__v1.zip
about 99KB in size in your Downloads. - Unzip this file, which should create:
spectra.csv
data_onevalueperbee.csv
data_visitationsequence.csv
- We are going focus on the file
data_onevalueperbee.csv
. Upload this file to your Jupyter server using theUpload
button in the file view. Upload it into the same directory as this.ipynb
file. Alternatively, if you are running this on your own computer, you can copy this.csv
file into the directory with your.ipynb
file.
- Check that now have a file called
Look at the file¶
As we have covered, CSV stands for "Comma separated values" and is one of the most widely used file formats for scientific data. CSV files are typically like tables, potentially with column names in the first line. The lines of the file are then rows of the table. Spreadsheet programs can open most CSV files, sometimes with a few settings being required to deal with various CSV dialects.
Let's take a look at this CSV file. This is mostly a repitition of what we recently looked at. We open the CSV file read the first few lines.
# Open the file in Python. The variable `fd` will be assigned the open file object.
fd = open('data_onevalueperbee.csv', mode="rb")
# Iterate over the lines in the file, also creating an integer called `line_number` via the
# use of the `enumerate()` function.
for (line_number, line) in enumerate(fd.readlines()):
# Print each line.
print(line)
if line_number >= 3:
# Stop after a few lines
break
# Close the open file object.
fd.close()
So, note that instead of a comma (,
) separating the columns, we have a semicolon (;
) instead. So we will need to use this when calling the Pandas read_csv()
function. Thus, we must use the sep=';'
keywoard argument. Also, this CSV file has some slightly unusual characters in it. From trial and error, I have learned that it must be opened with a encoding='latin_1'
keyword argument to read_csv()
.
Q3. Read the file data_onevalueperbee.csv
? Into a dataframe named df
.¶
Hint: use the read_csv
function not only with the filename as the required first positional argument, but also with the keyword arguments described above.
# YOUR CODE HERE
raise NotImplementedError()
# If this runs without error, it means the answer in your previous cell was correct.
assert ads_hash(df)=='15994f5b0e'
# Let's look at the first few lines of the file with the pandas DataFrame `.head()` method:
df.head()
Q4. How many rows of data are in the dataframe you read from the file data_onevalueperbee.csv
? Put the answer in a variable num_rows
.¶
Hint: you can use len(df)
to calculate the number of rows in the DataFrame df
.
# YOUR CODE HERE
raise NotImplementedError()
# If this runs without error, it means the answer in your previous cell was correct.
assert ads_hash(num_rows)=='ff2ccb6ba4'
Q5. What are the unique values of the nest
column?¶
Put your answer in the variable unique_nests
. Hint: use the .unique()
method on the Series for the nest
column.
# YOUR CODE HERE
raise NotImplementedError()
# If this runs without error, it means the answer in your previous cell was correct.
assert ads_hash(unique_nests)=='59e69ce283'
Q6. What are the unique values of the treatment
column?¶
Put your answer in the variable unique_treatments
.
# YOUR CODE HERE
raise NotImplementedError()
# If this runs without error, it means the answer in your previous cell was correct.
assert ads_hash(unique_treatments) == '670c49c25a'
Now let's check which nests are in each treatment:
for treatment, gdf in df.groupby('treatment'):
print(treatment)
print(gdf['nest'].unique())
print()
Q7. Now make a seaborn
boxplot
of the time to start foraging (tstartforaging
in the CSV file) on the y
axis, nest
number on the x
axis, and with the color (hue
) being the treatment
. Save the plot to a file named Figure2a.png
.¶
Your plot should look like:
Check the left panel of Figure 2a from the Lämsä et al. Proc. Roy Soc B. 2018 paper. Do you see any similarities or differences with your plot?
Hint: import seaborn and call its boxplot()
function with data
, x
, y
, hue
keyword arguments with values df
, nest
, tstartforaging
, treatment
.
# Run this cell to import seaborn
import seaborn as sns
# YOUR CODE HERE
raise NotImplementedError()
Q8. Now make a plot using a single line using seaborn's boxplot
function which shows the duration of the foraging period on the y axis.¶
This should be fairly similar to the second panel in Figure 2a from the Lämsä et al. Proc. Roy Soc B. 2018 paper.
Hint: the duration of the foraging period is in the 'maxminusstarttime'
column.
# YOUR CODE HERE
raise NotImplementedError()
Q9. Make a dictionary called mean_duration_of_foraging_by_treatment
. Fill this with items whose keys are treatment strings ('control'
and 'imidacloprid'
) and whose values are the mean of all 'maxminusstarttime'
values for rows in which the treatment
variable is the same as the key.¶
Hint: use groupby
to group on the treatment
column and use the .mean()
method on the maxminusstarttime
Series within each group data frame. Store each mean as the value in your dictionary for the key.
# YOUR CODE HERE
raise NotImplementedError()
# If this runs without error, it means the answer in your previous cell was correct.
assert ads_hash([(k,int(round(v))) for (k,v) in mean_duration_of_foraging_by_treatment.items()]) == '14fe49a61a'
Q10. Make a dictionary num_unique_bees_per_nest
which has key-value pairs of nest number and the number of unique bees from that nest.¶
Hint use .groupby
on the 'nest'
column.
# YOUR CODE HERE
raise NotImplementedError()
# If this runs without error, it means the answer in your previous cell was correct.
assert ads_hash(num_unique_bees_per_nest)=='ddc47911c0'