Councils in Action

An exploration of data made available by Council Data Project

Author

Eva Maxfield Brown

Published

October 11, 2022

Modified

Invalid Date

Abstract

Large scale comparative research into municipal governance is often prohibitively difficult due to a lack of high-quality data. Recent advances in speech-to-text algorithms and natural language processing techniques has made it possible to more easily collect and analyze this type of data. In this paper, we introduce an open-source platform, the Council Data Project (CDP), to curate novel datasets for research into municipal governance. The contribution of this work is two-fold: 1. We demonstrate that CDP, as an infrastructure, can be used to assemble reliable comparative data across municipalities; 2. We provide exploratory analysis to show how CDP data can be used to gain insight into how municipal governments perform over time. We conclude by describing future directions for research on and with CDP such as the development of machine learning models for speaker annotation, outline generation, and named entity recognition to the linking of data for large-scale comparative research.

Quarto

This document will explain what Council Data Project (CDP) is and how the data generated and stored by CDP instances make large-scale computational social and political science studies possible on the municipal level. It is an slightly modified, up-to-date, interactive version our ASIS&T AM22 conference paper [1].

Brown, Eva Maxfield, and Weber, Nicholas (2022). Councils in action: Automating the curation of municipal governance data for research. 10.48550/ARXIV.2204.09110.

If you want to skip to the data gather, visualization, and plotting code, jump to #the-councils-in-action-dataset.

Introduction

Federalism, where power devolves to states and municipalities, is a defining feature of US democracy. But, federalism also poses substantial challenges to the study of state and local governments - each state and each city or county in the USA has their own rules and regulations that structure a mode of governance. For this reason, political science research into local government is often referred to as a “black hole” [2]. It can be an extended exercise in data collection to determine the identity of elected officials across the 89,000 local governments in the USA, let alone measure and compare the performance of one system of local governance to another [3].

Sapotichne, J., Jones, B.D., and Wolfe, M. (2007). Is urban politics a black hole? Analyzing the boundary between political science and urban politics. Urban Affairs Review 43, 76–106.

Sumner, J.L., Farris, E.M., and Holman, M.R. (2020). Crowdsourcing reliable local data. Political Analysis 28, 244–262.

Alex Aichinger (2009). Open meeting laws and freedom of speech.

Ferreira, F., and Gyourko, J. (2014). Does gender matter for political leadership? The case of US mayors. Journal of Public Economics 112, 24–39.

Despite this challenge, there are common legal requirements for transparency in state and local government legislatures that can improve the quality, scale, and usefulness of data used to study local governance. Open meeting laws, also called sunshine laws, require that most meetings of state and local governments be open to the public, along with their decisions and records [4]. This means, in theory, recordings of a legislative session, bills and materials supporting the proposal of a bill, as well as voting outcomes are required to be publicly accessible. In practice, access to recordings and voting records is often hampered by closed, proprietary information systems that are difficult to navigate, search, and extract meaningful data for both citizens and researchers alike [5].

Over the last two years, the Council Data Project (CDP) has developed an open-source software platform that significantly improves access and engagement with local government data [6]. In the following paper we provide a brief background of CDP, and how this open-source platform enables large-scale, longitudinal, comparative research into municipal governance. In doing so, we position our work amongst other large-scale, public interest technology projects, and describe how CDP uniquely enables the assembly of comparative data for analysis. We then use CDP to assemble a longitudinal dataset from three different municipal councils (Seattle City Council, Portland City Council, and King County Council). We use this dataset to perform exploratory data analysis through the development of an N-gram plotting tool. Finally, we conclude by describing our own future work and the future work made possible by CDP and a growing data collection that we refer to as “Councils in Action.”

Council Data Project data now covers many more cities and counties. As this is a generated document, we will not limit ourselves to just examining Seattle City Council, Portland City Council, and King County Council. Rather, all plots and analyses will be updated each week with the latest data as we re-generate this arcticle.

Background

Public Interest Technology for Municipal Events

A number of previous civic technology applications have created valuable and accessible local government data from publicly available information, but none have specifically focused on aggregating transcripts of legislative discussion. In the following section we review four of these applications, and note how they relate to but differ from CDP. Councilmatic was one of the first public interest technology projects to focus on making council information more accessible. Councilmatic is a system for processing and archiving past municipal council meetings and legislative information and tracking upcoming meetings. Councilmatic is entirely open-source and there are now working examples of this application being used in cities throughout the USA, including for Los Angeles, Chicago, New York City, and Oakland [7]. However, each instance of Councilmatic, at a city or county level, is entirely separate from each other in their setup and maintenance. This distributed architecture makes it difficult to collaboratively develop new features, and prohibits cross-municipality data aggregation.

Poe, M., and Gregg, F. (2015). City council legislative subscription service.

Roy, D. (2020). Local voices network: Bringing under-heard community voices, perspectives and stories forward for healthier public dialogue. In 2020 conference of the american association for applied linguistics (AAAL) (AAAL).

Local Voices Network (LVN), a project from Cortico AI, provides a powerful platform for generating, visualizing, and searching through transcripts of civic “conversations.” LVN tools are highly targeted at reaching out to communities, facilitating a small-group conversation, and making such conversations easy to digest via machine learning analysis and visualization [8]. Such facilitation of community conversations engineers a discussion about specific topics and places them into a standardized format rather than curating existing data into a standard form. LVN produces novel insight into community sentiment but at the high cost of facilitation and engineering.

Big Local News (BLN), from researchers at Stanford University has created a platform to obtain data, through web-scraping, from municipal councils across the country. Each scraper from BLN collects meeting assets such as documents, presentations, videos, captions, etc.. Further, there is currently no processing of the scraped data (e.g. transcript generation) in order to expand the usage of these documents to include council discussions for analysis.

Blockparty is an emergent civic technology project which generates and analyzes meeting transcripts from New York City community council meetings. Blockparty creates and processes meeting transcripts to extract keywords and other potential highlights, and then publishes both the transcript and a keyword histogram to the web. Blockparty currently only serves New York City and without open-source code we are limited in understanding their deployment and data access mechanisms. Specifically, we do not know how researchers and civic hackers alike might deploy Blockparty for their own municipality, and more importantly, there is no structured open data produced by this project that can be analyzed for research purposes.

Lastly, while many projects are focused on the collection and republishing of municipal meeting data (from documents, videos, and in some cases transcripts), it is rare for a project to address the data collection, aggregation, transformation, and linkage of data together. For example, Councilmatic is the only project from prior examples which additionally stores legislative outcomes and voting records. The storage of such data is a valuable attachment which allows for investigation not only from discussion but additionally against the legislative end result.

Council Data Project

Council Data Project (CDP) attempts to improve upon state-of-the-art public interest technology projects by providing a low-cost, flexible and scalable open-source solution for generating a large standardized corpora of municipal council meetings. CDP can be deployed in any municipality with minimal configuration by a developer or city IT department. Using a simple Python-based ‘cookie-cutter template’ [9] a developer can configure a new CDP deployment with just a two-line installation process and fully deploy the instance once provided a function to gather events [6]. Once fully deployed, an instance of CDP will collect and process municipal meeting minutes, agendas, voting records, and crucially, the recorded video or audio of each meeting that is archived by a municipality. For every legislative meeting that CDP processes, the system generates a transcript from the provided video using a machine learning model that converts recorded audio to text (aka speech-to-text processing). On a continuous schedule CDP infrastructure then uses this corpus of transcripts to generate and update a keyword based index to enable search of meetings by keyword [6].

Audrey Roy Greenfeld, C.C. A command-line utility that creates projects from cookiecutters (project templates), e.g. Creating a python package project from a python package project template.

Brown, E.M., Huynh, T., Na, I., Ledbetter, B., Ticehurst, H., Liu, S., Gilles, E., Cho, S., Ragoler, S., Weber, N., et al. (2021). Council data project: Software for municipal data collection, analysis, and publication. Journal of Open Source Software 6, 3904.

To make CDP data more accessible to the public, and to provide an interface for building a deeper, contextual understanding, each CDP infrastructure also publishes a website (see Figure 1). Users of CDP websites can search for municipal meetings using the keyword index and then retrieve the video of a meeting, the sentence time-stamped transcript (for reading and for jumping playback of video to a sentence start point), access the full minutes of the meeting, and view votes that took place during a meeting. Further, because CDP transforms municipal government data into a database specification, CDP also aggregates and publishes aggregate data - such as the entire voting records of city council members, or document timestamps that show when specific actions were taken on a piece of municipal legislation.

Figure 1: Event search results for query: “defund the police” using the Seattle City Council CDP instance website. The search page includes event ‘cards’ which contain a thumbnail from the meeting video, the meeting date, the meeting committee name, a snippet from the meeting transcript that contains and highlights one or more of the keyword search terms which were found in the transcript, and the meeting keywords. Above the event cards, are filters for the returned query for filtering by committee and date. Additionally there are options for sorting the returned results. The Seattle CDP instance website is available at: https://councildataproject.org/seattle/

CDP imposes minimal requirements as to the level of basic information that must be collected for a municipal event to be identified, accessed, and processed. At a bare minimum, for CDP infrastructures to process and store an event, the system must be given a URL to a video of the meeting, the date of the meeting, and the name of the meeting committee (i.e. “Full Council”, “Transportation Committee”, etc.). This allows CDP instances to be deployed for less resource-available councils (i.e. school boards, neighborhood zoning boards, etc.) while still producing a standardized transcript and access mechanism to both view and download the data for further exploration and analysis. Data ingestion can be customized for each CDP deployment, but the core processing pipelines, infrastructure configuration, and web application are all shared across any city that deploys CDP. This allows for a much easier and larger collaborative effort between developers and open-source software contributors. In the following section, we detail previous text-as-data datasets from government sources that have been constructed and how such CDP deployments can be utilized to compile a large corpus of municipal meeting transcripts for analysis.

Previous Municipal Meeting Datasets

Municipal meeting data is used across a number of domains of research that are interested in the institutional design and functioning of local governments - from political science and sociology to legal scholarship. In the following section we highlight a study from Einstein et al. which investigated who participates in local meetings and a study from Jacobi and Schweers which investigated how gender, ideology, and seniority affect Supreme Court oral argument. Both studies relied upon utilizing meeting records (video or transcript) for analysis of participant behaviors.

Einstein et al. provided a comprehensive look into who participates in local government. Specifically, they “[compiled and coded] new data on all citizen participants in planning and zoning board meetings dealing with the construction of multiple housing units in 97 Massachusetts cities and towns.” The researchers then matched, “thousands of individual participants to the Massachusetts voter file to explore who participates in local political meetings” [10]. This paper utilized text annotation and topic and sentiment encoding to first identify participants, and then determine what each participant did and did not support in regards to specific planning and zoning discussions. In their study, data collection and coding were done in a combined manual and automated process. Public comment coding and annotation was completed by identifying participants’ names and addresses when they spoke. Once the data had been manually collected, Einstein et al. used probabilistic name and address matching with Massachusetts voting records in order to match each participant to their voting record details and then manually verified matches.

11.

Jacobi, T., and Schweers, D. (2017). Justice, interrupted: The effect of gender, ideology, and seniority at supreme court oral arguments. Va. L. Rev. 103, 1379.

In a similar study which used mixed manual and automated methods of constructing a dataset for federal governance research, Jacobi and Schweers attempted to measure the effect of gender, ideology, and seniority at Supreme Court oral arguments. Their work processed hundreds of transcripts to search and record interruptions between the legal advocates and the Supreme Court Justices (and between Justices) [11]. Jacobi and Schweers work was made possible by two separate databases: an existing publicly available database of specifically Roberts Court oral arguments and a second database that was manually assembled to store in-depth analysis of interruption behaviors.

These two examples illustrate how transcript data from governance deliberations can be used to study an enormous range of consequential topics - from gendered speech patterns, to representative democratic outcomes. While these results are individually impactful, the ability to build and expand upon this research is limited because of expensive and time consuming processes required for manually collecting, processing, and structuring data for analysis. In the following sections we describe the content and structure of data made available by CDP instances and how we can make analyses of municipal governance both accessible and reproducible for research. In particular, we detail the construction of a dataset, Councils in Action, and describe how it was prepared as a corpus of machine readable transcripts ready for analysis. We then perform exploratory analysis to demonstrate the value of this corpus for municipal governance research.

The Councils in Action Dataset

This section replaces original paper text with auto-generated text to stay up-to-date with the state of the dataset.

Original Paper Text

Using Council Data Project infrastructures we assemble longitudinal data from across multiple municipal councils to ease manual curation for researchers. The proof-of-concept dataset, Councils in Action, is a corpus of over 350 meetings of the city councils of Seattle Washington and Portland Oregon and the county council of King County Washington. Each meeting in our dataset includes a video file, an audio file, a transcript, and the full meeting minutes (legislative items, votes, and attached documents). Table 1 provides specific details as to the number of meetings from each municipal council and their first and last event dates.

Instance	Events	First Event	Last Event
cdp-seattle-21723dcf	256	2021-01-04	2022-03-29
cdp-king-county-b656c71b	72	2021-10-05	2022-03-30
cdp-portland-d2bbda97	32	2021-07-07	2022-03-30

Show Code for Generated Paragraph

# Markdown rendering
from IPython.display import Markdown

# Core computation
from datetime import datetime

from cdp_backend.database import models as db_models
from cdp_data import CDPInstances
from cdp_data.utils import connect_to_database
import numpy as np
import pandas as pd

# Get all instance infrastructure strings
ALL_INSTANCES = [
  getattr(CDPInstances, i)
  for i in dir(CDPInstances) if "__" not in i
]

# Get dataset size and dates
data_coverage_list = []
for instance in ALL_INSTANCES:
  connect_to_database(instance)

  # Get all events to calculate size
  events = list(db_models.Event.collection.fetch())
  num_events = len(events)

  # Only continue if the instance has data
  if num_events > 0:
    # Get earliest event datetime
    current_earliest_datetime = None
    current_latest_datetime = None
    for event in events:
      if (
        current_earliest_datetime is None
        or event.event_datetime < current_earliest_datetime
      ):
        current_earliest_datetime = event.event_datetime
      if (
        current_latest_datetime is None
        or event.event_datetime > current_latest_datetime
      ):
        current_latest_datetime = event.event_datetime
    
    # Add instance data
    data_coverage_list.append({
      "Instance": instance,
      "Events": num_events,
      "Oldest Event": current_earliest_datetime.date().isoformat(),
      "Newest Event": current_latest_datetime.date().isoformat(),
    })

# To dataframe
data_coverage = pd.DataFrame(data_coverage_list)
data_coverage = data_coverage.sort_values(by=["Events"], ascending=False)

# MARKDOWN TEMPLATE
COVERAGE_PARAGRAPH = (
  "As of `{generation_date}`, the Councils in Action dataset, includes data for "
  "`{n_meetings}` meetings from across `{n_instances}` different municipal councils. "
  "@tbl-cdp-data-coverage provides specific details about the number of meetings, "
  "and the oldest and newest event dates for each municipal council."
)

# Render paragraph
Markdown(
  COVERAGE_PARAGRAPH.format(
    generation_date=datetime.utcnow().date().isoformat(),
    n_meetings=data_coverage["Events"].sum(),
    n_instances=len(data_coverage.index),
  ).strip()
)

As of 2022-12-06, the Councils in Action dataset, includes data for 2586 meetings from across 17 different municipal councils. Table 1 provides specific details about the number of meetings, and the oldest and newest event dates for each municipal council.

Show Code for Rendering Table

# Import even more base firestore client
from google.auth.credentials import AnonymousCredentials
from google.cloud.firestore import Client

# Import itables and make interactive
from itables import show
import itables.options as table_opts
table_opts.lengthMenu = [25, 50, 100]

# Wrap all instances with links to the website
def _wrap_infra_slug_in_website_link(infra_slug: str) -> str:
  # Connect
  db_client = Client(
    project=infra_slug,
    credentials=AnonymousCredentials(),
  )
  
  # Get metadata
  instance_metadata = db_client.document("metadata/configuration").get().to_dict()
  instance_webpage = instance_metadata["hosting_web_app_address"]

  # Wrap in markdown
  return f'<a href="{instance_webpage}">{infra_slug}</a>'

render_ready_data_coverage = data_coverage.copy(deep=True)
render_ready_data_coverage["Instance"] = render_ready_data_coverage["Instance"].apply(
  _wrap_infra_slug_in_website_link
)
render_ready_data_coverage = render_ready_data_coverage.set_index("Instance", drop=True)

show(render_ready_data_coverage)

	Events	Oldest Event	Newest Event
Instance
Loading... (need help?)

Table 1: Councils in Action Data Coverage

Table 1 is interactive. Try sorting and searching.

As described in Section 3.2, each CDP instance has a website to search, discover, and link data together for a single event. To serve researchers, we further make this dataset available via Python API and ZIP archive download. We provide the cdp-data Python library specifically to access, download, cache, and analyze the Councils in Action dataset. For full documentation of all functionality available in the cdp-data library please see the provided package documentation: https://councildataproject.org/cdp-data/. For lower-level, direct database access, we provide the Python library cdp-backend. More information on lower level access to each instance is made available on each CDP deployment’s repository README (i.e. https://github.com/CouncilDataProject/seattle) and extensive documentation as to the CDP database schema is made available via the cdp-backend package documentation: https://councildataproject.org/cdp-backend/.

The flexibility in data collection afforded by CDP’s distributed instance deployment model allows the dataset to rapidly scale both vertically (in the number of meetings for any single council) and horizontally (as more CDP deployments are created). Therefore, as more CDP instances are created, the generated Councils in Action dataset removes barriers to research that have previously been hindered by time-consuming manual data collection and analysis.

Exploratory Data Analysis

In the following section we use the Councils in Action dataset to explore and examine trends in council meetings, including public comments, over time. Our exploratory analysis focuses on keywords or N-grams. N-gram viewers have been commonly created to visualize trends in the usage of specific n-grams in large literature corpora over time [12]. Such approaches are often considered a way to ‘distantly read’ a corpus of texts [13]. Distant readings of council meetings can help understand broad trends in the way that a topic increases or decreases in importance during legislative processes. For example, if a topic decreases in frequency then, broadly, we can interpret this topic as being less important in the municipal government’s legislative agenda.

12.

Michel, J.-B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., Team, G.B., Pickett, J.P., Hoiberg, D., Clancy, D., Norvig, P., et al. (2011). Quantitative analysis of culture using millions of digitized books. science 331, 176–182.

13.

Organisciak, P., Schmidt, B.M., and Downie, J.S. (2022). Giving shape to large digital libraries through exploratory data analysis. Journal of the Association for Information Science and Technology 73, 317–332.

Original Paper Text

For the Councils in Action dataset we apply the use of an n-gram visualization in order to demonstrate how topic trends evolve over time. First we use longitudinal data from the transcripts of Seattle City Council meetings to show the usage of specific n-grams as a percent of total n-grams used for each meeting during this time period. Figure 2 shows the usage of n-grams stemming from “police”, “housing”, “union”, and “homelessness” from January 1, 2021 to April 1, 2022 during meetings of the Seattle City Council.

Show Code for Pulling Original N-Gram Viewer Data

# Import keywords utilities
from cdp_data import keywords

# Computation
ORIGINAL_PAPER_PLOT_NGRAMS = [
    "police",
    "housing",
    "union",
    "homelessness",
]

original_paper_seattle_ngram_usage = keywords.compute_ngram_usage_history(
    infrastructure_slug=CDPInstances.Seattle,
    ngram_size=1,  # generate unigrams
    strict=False,  # stem grams
    start_datetime="2021-01-01",
    end_datetime="2022-04-01",
    tqdm_kws=dict(disable=True),
)

Show Code for Original N-Gram Viewer

from cdp_data import plots

grid = plots.plot_ngram_usage_histories(
    ngram=ORIGINAL_PAPER_PLOT_NGRAMS,
    gram_usage=original_paper_seattle_ngram_usage,
    strict=False,  # stem provided grams
    lmplot_kws=dict(  # extra plotting params
        col="ngram",
        hue="ngram",
        col_wrap=2,
        scatter_kws={"alpha": 0.2},
        aspect=1.6,
    ),
    tqdm_kws=dict(disable=True),
)
grid.fig.set_size_inches(6.8, 4.857)

Figure 2: N-gram usage over time for Seattle City Council meetings from January 1, 2021 to April 1, 2022. The selected n-grams are “polic” (the stem of police, policing, etc.), “hous” (the stem of house, housing, etc.), “union”, and “homeless”. The y-axis represents the usage of each n-gram – the percent of the number of times the specific n-gram was used for the day over the total number of n-grams used during the day.

Show Code for Generating Paragraph

from cdp_data import keywords
from datetime import datetime
from dateutil.relativedelta import relativedelta

INVESTIGATION_NGRAMS = [
  "police",
  "housing",
  "union",
  "homelessness",
]

# Get data
seattle_ngram_usage = keywords.compute_ngram_usage_history(
    infrastructure_slug=CDPInstances.Seattle,
    raise_on_error=False,
    tqdm_kws=dict(disable=True),
)

# Find earliest and latest datetimes
earliest_dt = pd.to_datetime(seattle_ngram_usage.session_datetime.min())
latest_dt = pd.to_datetime(seattle_ngram_usage.session_datetime.max())
earliest_day = earliest_dt.date().isoformat()
latest_day = latest_dt.date().isoformat()

# Formatted investigation grams
formatted_investigation_grams = ""
for i, gram in enumerate(INVESTIGATION_NGRAMS):
  if i + 1 == len(INVESTIGATION_NGRAMS):
    # include and
    formatted_investigation_grams += f", and `'{gram}'`"
  elif i == 0:
    # start
    formatted_investigation_grams += f"`'{gram}'`"
  else:
    # move along
    formatted_investigation_grams += f", `'{gram}'`"

# MARKDOWN TEMPLATE
SEATTLE_NGRAMS_TIMESPAN_PARAGRAPH = (
  f"For the Councils in Action dataset we apply the use of an n-gram visualization "
  f"in order to demonstrate how keyword trends evolve over time. First we use "
  f"longitudinal data from the transcripts of Seattle City Council meetings to show "
  f"the usage of specific n-grams as a percent of total n-grams used for each meeting "
  f"during this time period. @fig-seattle-ngram-viewer shows the usage of n-grams "
  f"stemming from {formatted_investigation_grams} from `{earliest_day}` to "
  f"`{latest_day}` during meetings of the Seattle City Council."
)

# Render paragraph
Markdown(SEATTLE_NGRAMS_TIMESPAN_PARAGRAPH.strip())

For the Councils in Action dataset we apply the use of an n-gram visualization in order to demonstrate how keyword trends evolve over time. First we use longitudinal data from the transcripts of Seattle City Council meetings to show the usage of specific n-grams as a percent of total n-grams used for each meeting during this time period. Figure 3 shows the usage of n-grams stemming from 'police', 'housing', 'union', and 'homelessness' from 2020-01-06 to 2022-11-28 during meetings of the Seattle City Council.

Show Code for N-Gram Viewer

from cdp_data import plots
plots.set_cdp_plotting_styles()

# Plot
grid = plots.plot_ngram_usage_histories(
    INVESTIGATION_NGRAMS,
    seattle_ngram_usage,
    lmplot_kws=dict(  # extra plotting params
        col="ngram",
        col_wrap=2,
        hue="ngram",
        scatter_kws={"alpha": 0.2},
        aspect=1.6,
    ),
    tqdm_kws=dict(disable=True),
)
grid.fig.set_size_inches(6.8, 4.857)

Figure 3: N-gram usage over time for Seattle City Council meetings. The x-axis represents individual days. The y-axis represents the usage of each n-gram – the percent of the number of times the specific n-gram was used for the day over the total number of n-grams used during the day.

Show Code for Generating Paragraph

INVESTIGATION_STATS = {
  gram: {
    "mean_min": {
      "value": None,
      "std": None,
      "month_start_dt": None,
    },
    "mean_max": {
      "value": None,
      "std": None,
      "month_start_dt": None,
    },
  }
  for gram in INVESTIGATION_NGRAMS
}

# Convert these to months to interatively process the data
earliest_month = datetime(earliest_dt.year, earliest_dt.month, 1)
latest_month = datetime(latest_dt.year, latest_dt.month, 1)

# Iter months and keep track of high and low months for each investigation ngram
current_month = earliest_month
while current_month < latest_month:
  # Select data
  month_end = current_month + relativedelta(months=1)
  month_data = seattle_ngram_usage.loc[
    (seattle_ngram_usage.session_datetime  >= current_month.date().isoformat())
    & (seattle_ngram_usage.session_datetime < month_end.date().isoformat())
  ]

  # Get stats
  for ngram in INVESTIGATION_NGRAMS:
    stemmed_gram = keywords._stem_n_gram(ngram)
    month_ngram_selected_data = month_data.loc[month_data.ngram == stemmed_gram]
    month_day_ngram_percent_usage = month_ngram_selected_data.day_ngram_percent_usage
    if len(month_day_ngram_percent_usage) > 0:
      mean_percent_discussion = month_day_ngram_percent_usage.mean()
      std_percent_discussion = month_day_ngram_percent_usage.std()
    else:
      mean_percent_discussion = 0.0
      std_percent_discussion = 0.0

    # Update
    ngram_stats = INVESTIGATION_STATS[ngram]
    if (
      ngram_stats["mean_min"]["value"] is None
      or ngram_stats["mean_min"]["value"] > mean_percent_discussion
    ):
      ngram_stats["mean_min"]["value"] = mean_percent_discussion
      ngram_stats["mean_min"]["std"] = std_percent_discussion
      ngram_stats["mean_min"]["month_start_dt"] = current_month
    if (
      ngram_stats["mean_max"]["value"] is None
      or ngram_stats["mean_max"]["value"] < mean_percent_discussion
    ):
      ngram_stats["mean_max"]["value"] = mean_percent_discussion
      ngram_stats["mean_max"]["std"] = std_percent_discussion
      ngram_stats["mean_max"]["month_start_dt"] = current_month
    
    INVESTIGATION_STATS[ngram] = ngram_stats

  # Update current month
  current_month = month_end

# MARKDOWN TEMPLATE
SEATTLE_NGRAMS_STATS_INTRO = (
  "To broaden our n-gram usage counting criteria to more than just our "
  "specific query grams, we stem all grams in the dataset using a "
  "[Snowball stemmer](https://www.nltk.org/api/nltk.stem.snowball.html?highlight=snowball%20stem#module-nltk.stem.snowball) "
  "to collect and plot the stemmed n-grams @nlp-python. This stemming helps "
  "collect and separate words together, for example, 'police' and 'policing' both "
  "stem from 'polic' but 'policy' stems from 'polici' @snowball-stemmer."
)

# Construct sentences for each gram
EACH_GRAM_STATS = []
for i, gram in enumerate(INVESTIGATION_STATS):
  # Construct main content
  gram_stats = INVESTIGATION_STATS[gram]
  single_ngram_stats = (
    f"percent usage of words stemming from `'{gram}'` "
    f"reached a maxmimum monthly average of "
    f"`{round(gram_stats['mean_max']['value'], 3)}` "
    f"± `{round(gram_stats['mean_max']['std'], 2)}` "
    f"from `{gram_stats['mean_max']['month_start_dt'].date().isoformat()}` to "
    f"`{(gram_stats['mean_max']['month_start_dt'] + relativedelta(months=1)).date().isoformat()}` "
    f"and a minimum monthly average of "
    f"`{round(gram_stats['mean_min']['value'], 3)}` "
    f"± `{round(gram_stats['mean_min']['std'], 2)}` "
    f"from `{gram_stats['mean_min']['month_start_dt'].date().isoformat()}` to "
    f"`{(gram_stats['mean_min']['month_start_dt'] + relativedelta(months=1)).date().isoformat()}`."
  )
  
  # Handle formatting
  if i + 1 == len(INVESTIGATION_STATS):
    chunk = f"Finally, the {single_ngram_stats}"
  elif i == 0:
    chunk = f"@fig-seattle-ngram-viewer shows that the {single_ngram_stats}"
  else:
    chunk = f"The {single_ngram_stats}"
  
  # Add this portion
  EACH_GRAM_STATS.append(chunk)

# Combine intro and chunks
joined_gram_stats = " ".join(EACH_GRAM_STATS)
joined_paragraph_parts = "\n\n".join([SEATTLE_NGRAMS_STATS_INTRO, joined_gram_stats])

# Render paragraph
Markdown(joined_paragraph_parts.strip())

To broaden our n-gram usage counting criteria to more than just our specific query grams, we stem all grams in the dataset using a Snowball stemmer to collect and plot the stemmed n-grams [14]. This stemming helps collect and separate words together, for example, ‘police’ and ‘policing’ both stem from ‘polic’ but ‘policy’ stems from ‘polici’ [15].

14.

Bird, S., Klein, E., and Loper, E. (2009). Natural language processing with python: Analyzing text with the natural language toolkit (" O’Reilly Media, Inc.").

15.

Porter, M. (2001). A language for stemming algorithms. DOI= https://snowball. tartarus. org/texts/introduction. html.

Figure 3 shows that the percent usage of words stemming from 'police' reached a maxmimum monthly average of 1.17 ± 0.34 from 2020-06-01 to 2020-07-01 and a minimum monthly average of 0.017 ± 0.01 from 2020-04-01 to 2020-05-01. The percent usage of words stemming from 'housing' reached a maxmimum monthly average of 0.663 ± 0.67 from 2022-06-01 to 2022-07-01 and a minimum monthly average of 0.138 ± 0.05 from 2022-10-01 to 2022-11-01. The percent usage of words stemming from 'union' reached a maxmimum monthly average of 0.214 ± 0.27 from 2022-02-01 to 2022-03-01 and a minimum monthly average of 0.025 ± 0.01 from 2020-11-01 to 2020-12-01. Finally, the percent usage of words stemming from 'homelessness' reached a maxmimum monthly average of 0.216 ± 0.16 from 2020-02-01 to 2020-03-01 and a minimum monthly average of 0.047 ± 0.07 from 2022-10-01 to 2022-11-01.

Further, Figure 4 compares daily keyword usage across municipal councils.

Show Code for Getting Multiple Instance Data

from cdp_data import CDPInstances, keywords

# Get data
selected_munis = keywords.compute_ngram_usage_history(
    infrastructure_slug=[
      CDPInstances.Seattle,
      CDPInstances.Louisville,
      CDPInstances.Oakland,
    ],
    raise_on_error=False,
    tqdm_kws=dict(disable=True),
)

Show Code for Plotting Multiple Instance Data

from cdp_data import plots
plots.set_cdp_plotting_styles()

grid = plots.plot_ngram_usage_histories(
    ["police", "housing"],
    selected_munis,
    lmplot_kws=dict(  # extra plotting params
        col="ngram",
        row="infrastructure",
        hue="ngram",
        scatter_kws={"alpha": 0.2},
        aspect=1.6,
    ),
    tqdm_kws=dict(disable=True),
)
grid.fig.set_size_inches(6.4, 2.6 * 3)

Figure 4: N-gram usage over time from selected municipal councils. The x-axis represents individual days. The y-axis represents the usage of each n-gram – the percent of the number of times the specific n-gram was used for the day over the total number of n-grams used during the day.

Conclusion

In this paper, we have argued that the deployment of Council Data Project infrastructures to cover municipal councils is a solution to not only increasing access to data, but standardizing this data for eased analysis. We have demonstrated that, with the proof-of-concept Councils in Action dataset, data produced by CDP infrastructures can be easily processed and analyzed to observe shared and unique discussion trends across municipal councils. As the number of CDP instances increases, the Councils in Action dataset can be used for even more rich and varied analyses. For example, in their comprehensive study detailing who participates in local government meetings, Einstein et al., concluded that while there may be suggestive evidence that the trends they found hold for other states, the largest limitation of their work is that the data comes from a single state [10].

10.

Einstein, K.L., Palmer, M., and Glick, D.M. (2019). Who participates in local government? Evidence from meeting minutes. Perspectives on politics 17, 28–46.

16.

Bredin, H., Yin, R., Coria, J.M., Gelly, G., Korshunov, P., Lavechin, M., Fustes, D., Titeux, H., Bouaziz, W., and Gill, M.-P. (2020). pyannote.audio: neural building blocks for speaker diarization. In ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (IEEE), pp. 7124–7128.

While the findings of such work cannot be automated, the laborious annotation process required before research can be made easier with models to automatically annotate topical discussion, named entities and the linkage between discussion and legislative action, and the annotation of speaker turns. Because all CDP instances share common processing pipelines, delivering new features to each instance (municipality) that CDP covers is made simple. For example, to replicate a study like Jacobi and Schweers “Justice, Interrupted”, but for every municipality covered by CDP, we have already begun to work on a method for fine-tuning an audio-based speaker identification transformer to label each sentence in a transcript with the known speaker’s name and using speaker diarization for labeling each of the unknown speaker’s during each meeting [16].

CDP and the Councils in Action dataset can also potentially be used to measure and automatically track the provenance and discussion from legislative action from “model bills” across the country [17]. A more general form of such work might look to measure the topical and legislative diffusion across the country, for example answering the question: “how long does it take for similar legislative actions regarding a topic to occur in multiple different municipalities?” There are additional computational research questions available for investigation with the Councils in Actions dataset such as the research and development of methods for minutes items and transcript alignment or even more generally, models for “outline generation” to automatically generate the minutes items of a meeting from a transcript [18,19].

17.

ALEC exposed.

18.

Tardy, P., Janiszek, D., Estève, Y., and Nguyen, V. (2020). Align then summarize: Automatic alignment methods for summarization corpus creation. arXiv preprint arXiv:2007.07841.

19.

Zhang, R., Guo, J., Fan, Y., Lan, Y., and Cheng, X. (2019). Outline generation: Understanding the inherent content structure of documents. In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, pp. 745–754.

Lastly, we emphasize that our proof-of-concept work in this paper demonstrates the possibility of research produced using the Councils in Actions dataset to make its way back into the CDP instances themselves to remove barriers to municipal information for all members of the communities they serve. Council Data Project, as an open infrastructure platform, affords researchers, journalists, activists, and community members the opportunity to directly integrate their work with the data processing pipelines and/or web applications that are connected to CDP deployments. Integration efforts can directly support others working with the Councils in Action dataset, or members of the public hoping to understand the larger context of discussion, track legislative action, and hold elected officials accountable.