Andrew Straw de006a7475 exercise-07 release

2024-11-25 08:20:05 +01:00

164 KiB

Raw Blame History

None <html> <head> </head>

In [ ]:

# You must run this cell, but you can ignore its contents.

import hashlib

def ads_hash(ty):
    """Return a unique string for input"""
    ty_str = str(ty).encode()
    m = hashlib.sha256()
    m.update(ty_str)
    return m.hexdigest()[:10]

In [ ]:

import pandas as pd

Let's get started by making a sample dataframe with fake data:

In [ ]:

sample_df = pd.DataFrame({'number':[1,2,3,234,2,3,2,2,1,2], 'color':['blue','blue','red','red','red','blue','blue','red','green','yellow']})
display(sample_df)

Q1 Create a Series named `condition` which is true for all rows where the color is red¶

In [ ]:

# YOUR CODE HERE
raise NotImplementedError()

In [ ]:

# If this runs without error, it means the answer in your previous cell was correct.
assert ads_hash(condition)=='28612bf01a'

Q2 Create a new `DataFrame` named `red_sample_df` which contains only the rows with red color from `sample_df`.¶

Hint: use your condition Series from above.

In [ ]:

# YOUR CODE HERE
raise NotImplementedError()

In [ ]:

# If this runs without error, it means the answer in your previous cell was correct.
assert ads_hash(red_sample_df)=='354bd7ec89'
assert ads_hash(sample_df)=='21692a4d62'

Now let's use the DataFrame .groupby() method to find the mean value of number for each color. (Hint: this will be useful later in this exercise.)

In [ ]:

for color, gdf in sample_df.groupby('color'):
    mean_number = gdf["number"].mean()
    print(f"{color} {mean_number}")

Using Pandas to work with real data¶

Get the CSV file and upload it to Jupyter¶

Go to https://datadryad.org/stash/dataset/doi:10.5061/dryad.vb06d3k and download the data by clicking "Download Dataset".
- Check that now have a file called doi_10.5061_dryad.vb06d3k__v1.zip about 99KB in size in your Downloads.
- Unzip this file, which should create:
  - spectra.csv
  - data_onevalueperbee.csv
  - data_visitationsequence.csv
- We are going focus on the file data_onevalueperbee.csv. Upload this file to your Jupyter server using the Upload button in the file view. Upload it into the same directory as this .ipynb file. Alternatively, if you are running this on your own computer, you can copy this .csv file into the directory with your .ipynb file.

Look at the file¶

As we have covered, CSV stands for "Comma separated values" and is one of the most widely used file formats for scientific data. CSV files are typically like tables, potentially with column names in the first line. The lines of the file are then rows of the table. Spreadsheet programs can open most CSV files, sometimes with a few settings being required to deal with various CSV dialects.

Let's take a look at this CSV file. This is mostly a repitition of what we recently looked at. We open the CSV file read the first few lines.

In [ ]:

# Open the file in Python. The variable `fd` will be assigned the open file object.
fd = open('data_onevalueperbee.csv', mode="rb")

# Iterate over the lines in the file, also creating an integer called `line_number` via the
# use of the `enumerate()` function.
for (line_number, line) in enumerate(fd.readlines()):
    # Print each line.
    print(line)
    if line_number >= 3:
        # Stop after a few lines
        break
# Close the open file object.
fd.close()

So, note that instead of a comma (,) separating the columns, we have a semicolon (;) instead. So we will need to use this when calling the Pandas read_csv() function. Thus, we must use the sep=';' keywoard argument. Also, this CSV file has some slightly unusual characters in it. From trial and error, I have learned that it must be opened with a encoding='latin_1' keyword argument to read_csv().

Q3. Read the file `data_onevalueperbee.csv`? Into a dataframe named `df`.¶

Hint: use the read_csv function not only with the filename as the required first positional argument, but also with the keyword arguments described above.

In [ ]:

# YOUR CODE HERE
raise NotImplementedError()

In [ ]:

# If this runs without error, it means the answer in your previous cell was correct.
assert ads_hash(df)=='15994f5b0e'

In [ ]:

# Let's look at the first few lines of the file with the pandas DataFrame `.head()` method:

df.head()

Q4. How many rows of data are in the dataframe you read from the file `data_onevalueperbee.csv`? Put the answer in a variable `num_rows`.¶

Hint: you can use len(df) to calculate the number of rows in the DataFrame df.

In [ ]:

# YOUR CODE HERE
raise NotImplementedError()

In [ ]:

# If this runs without error, it means the answer in your previous cell was correct.
assert ads_hash(num_rows)=='ff2ccb6ba4'

Q5. What are the unique values of the `nest` column?¶

Put your answer in the variable unique_nests. Hint: use the .unique() method on the Series for the nest column.

In [ ]:

# YOUR CODE HERE
raise NotImplementedError()

In [ ]:

# If this runs without error, it means the answer in your previous cell was correct.
assert ads_hash(unique_nests)=='59e69ce283'

Q6. What are the unique values of the `treatment` column?¶

Put your answer in the variable unique_treatments.

In [ ]:

# YOUR CODE HERE
raise NotImplementedError()

In [ ]:

# If this runs without error, it means the answer in your previous cell was correct.
assert ads_hash(unique_treatments) == '670c49c25a'

Now let's check which nests are in each treatment:

In [ ]:

for treatment, gdf in df.groupby('treatment'):
    print(treatment)
    print(gdf['nest'].unique())
    print()

Q7. Now make a `seaborn` `boxplot` of the time to start foraging (`tstartforaging` in the CSV file) on the `y` axis, `nest` number on the `x` axis, and with the color (`hue`) being the `treatment`. Save the plot to a file named `Figure2a.png`.¶

Your plot should look like:

Check the left panel of Figure 2a from the Lämsä et al. Proc. Roy Soc B. 2018 paper. Do you see any similarities or differences with your plot?

Hint: import seaborn and call its boxplot() function with data, x, y, hue keyword arguments with values df, nest, tstartforaging, treatment.

In [ ]:

# Run this cell to import seaborn
import seaborn as sns

In [ ]:

# YOUR CODE HERE
raise NotImplementedError()

Q8. Now make a plot using a single line using seaborn's `boxplot` function which shows the duration of the foraging period on the y axis.¶

This should be fairly similar to the second panel in Figure 2a from the Lämsä et al. Proc. Roy Soc B. 2018 paper.

Hint: the duration of the foraging period is in the 'maxminusstarttime' column.

In [ ]:

# YOUR CODE HERE
raise NotImplementedError()

Q9. Make a dictionary called `mean_duration_of_foraging_by_treatment`. Fill this with items whose keys are treatment strings (`'control'` and `'imidacloprid'`) and whose values are the mean of all `'maxminusstarttime'` values for rows in which the `treatment` variable is the same as the key.¶

Hint: use groupby to group on the treatment column and use the .mean() method on the maxminusstarttime Series within each group data frame. Store each mean as the value in your dictionary for the key.

In [ ]:

# YOUR CODE HERE
raise NotImplementedError()

In [ ]:

# If this runs without error, it means the answer in your previous cell was correct.
assert ads_hash([(k,int(round(v))) for (k,v) in mean_duration_of_foraging_by_treatment.items()]) == '14fe49a61a'

Q10. Make a dictionary `num_unique_bees_per_nest` which has key-value pairs of nest number and the number of unique bees from that nest.¶

Hint use .groupby on the 'nest' column.

In [ ]:

# YOUR CODE HERE
raise NotImplementedError()

In [ ]:

# If this runs without error, it means the answer in your previous cell was correct.
assert ads_hash(num_unique_bees_per_nest)=='ddc47911c0'

</html>

164 KiB Raw Blame History

Q1 Create a Series named condition which is true for all rows where the color is red¶

Q2 Create a new DataFrame named red_sample_df which contains only the rows with red color from sample_df.¶

Using Pandas to work with real data¶

Get the CSV file and upload it to Jupyter¶

Look at the file¶

Q3. Read the file data_onevalueperbee.csv? Into a dataframe named df.¶

Q4. How many rows of data are in the dataframe you read from the file data_onevalueperbee.csv? Put the answer in a variable num_rows.¶

Q5. What are the unique values of the nest column?¶

Q6. What are the unique values of the treatment column?¶

Q7. Now make a seaborn boxplot of the time to start foraging (tstartforaging in the CSV file) on the y axis, nest number on the x axis, and with the color (hue) being the treatment. Save the plot to a file named Figure2a.png.¶

Q8. Now make a plot using a single line using seaborn's boxplot function which shows the duration of the foraging period on the y axis.¶

Q9. Make a dictionary called mean_duration_of_foraging_by_treatment. Fill this with items whose keys are treatment strings ('control' and 'imidacloprid') and whose values are the mean of all 'maxminusstarttime' values for rows in which the treatment variable is the same as the key.¶

Q10. Make a dictionary num_unique_bees_per_nest which has key-value pairs of nest number and the number of unique bees from that nest.¶

164 KiB

Raw Blame History

Q1 Create a Series named `condition` which is true for all rows where the color is red¶

Q2 Create a new `DataFrame` named `red_sample_df` which contains only the rows with red color from `sample_df`.¶

Q3. Read the file `data_onevalueperbee.csv`? Into a dataframe named `df`.¶

Q4. How many rows of data are in the dataframe you read from the file `data_onevalueperbee.csv`? Put the answer in a variable `num_rows`.¶

Q5. What are the unique values of the `nest` column?¶

Q6. What are the unique values of the `treatment` column?¶

Q7. Now make a `seaborn` `boxplot` of the time to start foraging (`tstartforaging` in the CSV file) on the `y` axis, `nest` number on the `x` axis, and with the color (`hue`) being the `treatment`. Save the plot to a file named `Figure2a.png`.¶

Q8. Now make a plot using a single line using seaborn's `boxplot` function which shows the duration of the foraging period on the y axis.¶

Q9. Make a dictionary called `mean_duration_of_foraging_by_treatment`. Fill this with items whose keys are treatment strings (`'control'` and `'imidacloprid'`) and whose values are the mean of all `'maxminusstarttime'` values for rows in which the `treatment` variable is the same as the key.¶

Q10. Make a dictionary `num_unique_bees_per_nest` which has key-value pairs of nest number and the number of unique bees from that nest.¶