pm21-dragon/exercises/source/exercise-05/2__reading_csv_files.ipynb
2024-11-14 16:57:13 +01:00

94 KiB

None <html> <head> </head>

Reading CSV files

Step 1: Download file from https://archive.ics.uci.edu/ml/datasets/Wine+Quality . Click the "Download" button to get the wine+quality.zip file. Open this file and extract winequality-red.csv. Place it in the folder alongside this notebook.

Let's look at the first lines of this file.

In [1]:
fobj = open('winequality-red.csv')
for line_num, line in enumerate(fobj.readlines()):
    line = line.strip()
    print(f"line {line_num}: '{line}'")
    if line_num > 3:
        break
line 0: '"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"'
line 1: '7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5'
line 2: '7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5'
line 3: '7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5'
line 4: '11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6'

Q10 Read the file into a dict called data

The dict should have a key for each column in the CSV file and each dictionary value should be a list with all the values in that column.

For example, a CSV file like this:

name,home planet
Arthur,Earth
Zaphod,Betelgeuse V
Trillian,Earth

Would result in a dictionary like this:

{'name':['Arthur','Zaphod','Trillian'], 'home planet':['Earth', 'Betelgeuse V', 'Earth']}

But here, we read the file winequality-red.csv which you have uploaded into this folder. Note that in this wine quality "CSV" file, the values are separated with semicolons (;), not commas.

In [2]:
fobj = open('winequality-red.csv')
data = {}
for line_num, line in enumerate(fobj.readlines()):
    line = line.strip()
    #print(f"line {line_num}: '{line}'")
    entries = line.split(';')
    if line_num == 0:
        column_names = entries
        for column_name in column_names:
            data[column_name] = []
        continue
    for (colname, entry) in zip(column_names, entries):
        data[colname].append(float(entry))
data.keys()
Out[2]:
dict_keys(['"fixed acidity"', '"volatile acidity"', '"citric acid"', '"residual sugar"', '"chlorides"', '"free sulfur dioxide"', '"total sulfur dioxide"', '"density"', '"pH"', '"sulphates"', '"alcohol"', '"quality"'])
In [3]:
assert len(data.keys()) == 12
assert len(data['"alcohol"'])==1599
acc = 0; [acc := acc+x for x in data['"quality"']]
assert acc==9012

Q11 Plot the "Density" (Y axis) versus "Alcohol" (X axis).

Your plot should look like this:

In [4]:
import matplotlib.pyplot as plt
plt.plot(data['"density"'], data['"alcohol"'], '.')
plt.xlabel("Density")
plt.ylabel("Alcohol");

Q12 Make a Python program that does this

Create a Python program called plot_red_wine.py which makes the above plot (alcohol vs density for the red wine dataset) and saves the plot to a file called red_wine.png.

Hint: save the figure using the plt.savefig() function. (You might also want to play around with the plot.show() function.)

Uploading the exercise

For this exercise, the following files should be uploaded:

  • The two .ipynb files (overwriting the original ones, as usual).
  • plot_red_wine.py - Your Python script
  • winequality-red.csv - The file you downloaded
  • red_wine.png - The plot you generated using plot_red_wine.py.
</html>