You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: tutorials/python-programming-3.qmd
+70Lines changed: 70 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -75,8 +75,23 @@ I will focus on the `read_csv()` function to demonstrate the general process. Th
75
75
76
76
This chapter will use data from the Varities of Democracy (VDEM) dataset. VDEM is an ongoing research project to measure the level of democracy in governments around the world and updated versions of the dataset are released on an ongoing basis. The research is led by a team of over 50 social scientists who coordinate the collection and analysis of expert assessments from over 3,200 historians and Country Experts (CEs). From these assessments, the VDEM project has created a remarkably complex array of indicators designed to align with five high-level facets of democracy: electoral, liberal, participatory, deliberative, and egalitarian. The dataset extends back to 1789 and is considered the gold standard of quantitative data about global democratic developments. You can find the full codebook online, and I strongly recommend that you download it and consult it as you work with this data. You can find the full dataset at (https://www.v-dem.net/en/data/data/v-dem-dataset-v11/) and the codebook here (https://www.v-dem.net/media/filer_public/e0/7f/e07f672b-b91e-4e98-b9a3-78f8cd4de696/v-dem_codebook_v8.pdf). The filtered and subsetted version we will use in this book is provided in the `data/vdem` directory of the online learning materials.
77
77
78
+
> Download the Data
79
+
>
80
+
> We'll download the data using wget. Run the following command in your terminal:
@@ -410,6 +425,35 @@ The VDEM data contains an enormous amount of temporal data, but all at the level
410
425
411
426
Unlike the VDEM data, the Russian Troll Tweets come as a collection of `csv` files. We will use a clever little trick to load up all the data in a single dataframe. The code block below iterates over each file in the `russian-troll-tweets/` subdirectory in the data directory. If the file extension is `csv`, is reads the `csv` into memory as a dataframe. All of the dataframes are then concatenated into a single dataframe containing data on ~ 3M tweets.
412
427
428
+
> Download the Data
429
+
>
430
+
> Once again, we'll download the data using wget. Run the following command in your terminal:
> There are more than 2.9M tweets in this dataset, and 21 columns. Let's run this code with a 10% random sample. You can increase (or decrease) the sample size if you wish.
472
+
>
473
+
> ```{python}
474
+
> sample_size = 0.1
475
+
> tweets_df = tweets_df.sample(frac=sample_size)
476
+
> ```
477
+
>
478
+
479
+
```{python}
480
+
#| echo: false
481
+
sample_size = 0.1
482
+
tweets_df = tweets_df.sample(frac=sample_size)
483
+
```
484
+
426
485
As you can see, we have two datatypes in our dataframe: `object` and `int64`. Remember that Pandas uses `object` to refer to columns that contain strings, or which contain mixed types, such as strings and integers. In this case, they refer to strings.
427
486
428
487
One further thing to note about this dataset: each row is a tweet from a specific account, but some of the variables describe attributes of the tweeting accounts, not of the tweet itself. For example, followers describes the number of followers that the account had at the time it sent the tweet. This makes sense, because tweets don't have followers, but accounts do. We need to keep this in mind when working with this dataset.
@@ -575,6 +634,17 @@ When we concatenate the two dataframes the number of columns stays the same but
575
634
576
635
An alternative way to combine datasets is to **merge** them. If you want to create a dataframe that contains columns from multiple datasets but is aligned on rows according to some column (or set of columns), you probably want to use the `merge()` function. To illustrate this, we will work with data from two different sources. The first is the VDEM data we used in first part of this chapter (`fsdf`). The second is a dataset from Freedom House on levels of internet freedom in 65 countries. More information is available at https://freedomhouse.org/countries/freedom-net/scores.
0 commit comments