An exploration of data made available by Council Data Project
Author
Eva Maxfield Brown
Published
October 11, 2022
Modified
Invalid Date
Abstract
Large scale comparative research into municipal governance is often prohibitively difficult due to a lack of high-quality data. Recent advances in speech-to-text algorithms and natural language processing techniques has made it possible to more easily collect and analyze this type of data. In this paper, we introduce an open-source platform, the Council Data Project (CDP), to curate novel datasets for research into municipal governance. The contribution of this work is two-fold: 1. We demonstrate that CDP, as an infrastructure, can be used to assemble reliable comparative data across municipalities; 2. We provide exploratory analysis to show how CDP data can be used to gain insight into how municipal governments perform over time. We conclude by describing future directions for research on and with CDP such as the development of machine learning models for speaker annotation, outline generation, and named entity recognition to the linking of data for large-scale comparative research.
Quarto
This document will explain what Council Data Project (CDP) is and how the data generated and stored by CDP instances make large-scale computational social and political science studies possible on the municipal level. It is an slightly modified, up-to-date, interactive version our ASIS&T AM22 conference paper [1].
1.
Brown, Eva Maxfield, and Weber, Nicholas (2022). Councils in action: Automating the curation of municipal governance data for research. 10.48550/ARXIV.2204.09110.
Federalism, where power devolves to states and municipalities, is a defining feature of US democracy. But, federalism also poses substantial challenges to the study of state and local governments - each state and each city or county in the USA has their own rules and regulations that structure a mode of governance. For this reason, political science research into local government is often referred to as a “black hole” [2]. It can be an extended exercise in data collection to determine the identity of elected officials across the 89,000 local governments in the USA, let alone measure and compare the performance of one system of local governance to another [3].
2.
Sapotichne, J., Jones, B.D., and Wolfe, M. (2007). Is urban politics a black hole? Analyzing the boundary between political science and urban politics. Urban Affairs Review 43, 76–106.
3.
Sumner, J.L., Farris, E.M., and Holman, M.R. (2020). Crowdsourcing reliable local data. Political Analysis 28, 244–262.
Ferreira, F., and Gyourko, J. (2014). Does gender matter for political leadership? The case of US mayors. Journal of Public Economics 112, 24–39.
Despite this challenge, there are common legal requirements for transparency in state and local government legislatures that can improve the quality, scale, and usefulness of data used to study local governance. Open meeting laws, also called sunshine laws, require that most meetings of state and local governments be open to the public, along with their decisions and records [4]. This means, in theory, recordings of a legislative session, bills and materials supporting the proposal of a bill, as well as voting outcomes are required to be publicly accessible. In practice, access to recordings and voting records is often hampered by closed, proprietary information systems that are difficult to navigate, search, and extract meaningful data for both citizens and researchers alike [5].
Over the last two years, the Council Data Project (CDP) has developed an open-source software platform that significantly improves access and engagement with local government data [6]. In the following paper we provide a brief background of CDP, and how this open-source platform enables large-scale, longitudinal, comparative research into municipal governance. In doing so, we position our work amongst other large-scale, public interest technology projects, and describe how CDP uniquely enables the assembly of comparative data for analysis. We then use CDP to assemble a longitudinal dataset from three different municipal councils (Seattle City Council, Portland City Council, and King County Council). We use this dataset to perform exploratory data analysis through the development of an N-gram plotting tool. Finally, we conclude by describing our own future work and the future work made possible by CDP and a growing data collection that we refer to as “Councils in Action.”
Council Data Project data now covers many more cities and counties. As this is a generated document, we will not limit ourselves to just examining Seattle City Council, Portland City Council, and King County Council. Rather, all plots and analyses will be updated each week with the latest data as we re-generate this arcticle.
Background
Public Interest Technology for Municipal Events
A number of previous civic technology applications have created valuable and accessible local government data from publicly available information, but none have specifically focused on aggregating transcripts of legislative discussion. In the following section we review four of these applications, and note how they relate to but differ from CDP. Councilmatic was one of the first public interest technology projects to focus on making council information more accessible. Councilmatic is a system for processing and archiving past municipal council meetings and legislative information and tracking upcoming meetings. Councilmatic is entirely open-source and there are now working examples of this application being used in cities throughout the USA, including for Los Angeles, Chicago, New York City, and Oakland [7]. However, each instance of Councilmatic, at a city or county level, is entirely separate from each other in their setup and maintenance. This distributed architecture makes it difficult to collaboratively develop new features, and prohibits cross-municipality data aggregation.
Roy, D. (2020). Local voices network: Bringing under-heard community voices, perspectives and stories forward for healthier public dialogue. In 2020 conference of the american association for applied linguistics (AAAL) (AAAL).
Local Voices Network (LVN), a project from Cortico AI, provides a powerful platform for generating, visualizing, and searching through transcripts of civic “conversations.” LVN tools are highly targeted at reaching out to communities, facilitating a small-group conversation, and making such conversations easy to digest via machine learning analysis and visualization [8]. Such facilitation of community conversations engineers a discussion about specific topics and places them into a standardized format rather than curating existing data into a standard form. LVN produces novel insight into community sentiment but at the high cost of facilitation and engineering.
Big Local News (BLN), from researchers at Stanford University has created a platform to obtain data, through web-scraping, from municipal councils across the country. Each scraper from BLN collects meeting assets such as documents, presentations, videos, captions, etc.. Further, there is currently no processing of the scraped data (e.g. transcript generation) in order to expand the usage of these documents to include council discussions for analysis.
Blockparty is an emergent civic technology project which generates and analyzes meeting transcripts from New York City community council meetings. Blockparty creates and processes meeting transcripts to extract keywords and other potential highlights, and then publishes both the transcript and a keyword histogram to the web. Blockparty currently only serves New York City and without open-source code we are limited in understanding their deployment and data access mechanisms. Specifically, we do not know how researchers and civic hackers alike might deploy Blockparty for their own municipality, and more importantly, there is no structured open data produced by this project that can be analyzed for research purposes.
Lastly, while many projects are focused on the collection and republishing of municipal meeting data (from documents, videos, and in some cases transcripts), it is rare for a project to address the data collection, aggregation, transformation, and linkage of data together. For example, Councilmatic is the only project from prior examples which additionally stores legislative outcomes and voting records. The storage of such data is a valuable attachment which allows for investigation not only from discussion but additionally against the legislative end result.
Council Data Project
Council Data Project (CDP) attempts to improve upon state-of-the-art public interest technology projects by providing a low-cost, flexible and scalable open-source solution for generating a large standardized corpora of municipal council meetings. CDP can be deployed in any municipality with minimal configuration by a developer or city IT department. Using a simple Python-based ‘cookie-cutter template’ [9] a developer can configure a new CDP deployment with just a two-line installation process and fully deploy the instance once provided a function to gather events [6]. Once fully deployed, an instance of CDP will collect and process municipal meeting minutes, agendas, voting records, and crucially, the recorded video or audio of each meeting that is archived by a municipality. For every legislative meeting that CDP processes, the system generates a transcript from the provided video using a machine learning model that converts recorded audio to text (aka speech-to-text processing). On a continuous schedule CDP infrastructure then uses this corpus of transcripts to generate and update a keyword based index to enable search of meetings by keyword [6].
Brown, E.M., Huynh, T., Na, I., Ledbetter, B., Ticehurst, H., Liu, S., Gilles, E., Cho, S., Ragoler, S., Weber, N., et al. (2021). Council data project: Software for municipal data collection, analysis, and publication. Journal of Open Source Software 6, 3904.
To make CDP data more accessible to the public, and to provide an interface for building a deeper, contextual understanding, each CDP infrastructure also publishes a website (see Figure 1). Users of CDP websites can search for municipal meetings using the keyword index and then retrieve the video of a meeting, the sentence time-stamped transcript (for reading and for jumping playback of video to a sentence start point), access the full minutes of the meeting, and view votes that took place during a meeting. Further, because CDP transforms municipal government data into a database specification, CDP also aggregates and publishes aggregate data - such as the entire voting records of city council members, or document timestamps that show when specific actions were taken on a piece of municipal legislation.
CDP imposes minimal requirements as to the level of basic information that must be collected for a municipal event to be identified, accessed, and processed. At a bare minimum, for CDP infrastructures to process and store an event, the system must be given a URL to a video of the meeting, the date of the meeting, and the name of the meeting committee (i.e. “Full Council”, “Transportation Committee”, etc.). This allows CDP instances to be deployed for less resource-available councils (i.e. school boards, neighborhood zoning boards, etc.) while still producing a standardized transcript and access mechanism to both view and download the data for further exploration and analysis. Data ingestion can be customized for each CDP deployment, but the core processing pipelines, infrastructure configuration, and web application are all shared across any city that deploys CDP. This allows for a much easier and larger collaborative effort between developers and open-source software contributors. In the following section, we detail previous text-as-data datasets from government sources that have been constructed and how such CDP deployments can be utilized to compile a large corpus of municipal meeting transcripts for analysis.
Previous Municipal Meeting Datasets
Municipal meeting data is used across a number of domains of research that are interested in the institutional design and functioning of local governments - from political science and sociology to legal scholarship. In the following section we highlight a study from Einstein et al. which investigated who participates in local meetings and a study from Jacobi and Schweers which investigated how gender, ideology, and seniority affect Supreme Court oral argument. Both studies relied upon utilizing meeting records (video or transcript) for analysis of participant behaviors.
Einstein et al. provided a comprehensive look into who participates in local government. Specifically, they “[compiled and coded] new data on all citizen participants in planning and zoning board meetings dealing with the construction of multiple housing units in 97 Massachusetts cities and towns.” The researchers then matched, “thousands of individual participants to the Massachusetts voter file to explore who participates in local political meetings” [10]. This paper utilized text annotation and topic and sentiment encoding to first identify participants, and then determine what each participant did and did not support in regards to specific planning and zoning discussions. In their study, data collection and coding were done in a combined manual and automated process. Public comment coding and annotation was completed by identifying participants’ names and addresses when they spoke. Once the data had been manually collected, Einstein et al. used probabilistic name and address matching with Massachusetts voting records in order to match each participant to their voting record details and then manually verified matches.
11.
Jacobi, T., and Schweers, D. (2017). Justice, interrupted: The effect of gender, ideology, and seniority at supreme court oral arguments. Va. L. Rev. 103, 1379.
In a similar study which used mixed manual and automated methods of constructing a dataset for federal governance research, Jacobi and Schweers attempted to measure the effect of gender, ideology, and seniority at Supreme Court oral arguments. Their work processed hundreds of transcripts to search and record interruptions between the legal advocates and the Supreme Court Justices (and between Justices) [11]. Jacobi and Schweers work was made possible by two separate databases: an existing publicly available database of specifically Roberts Court oral arguments and a second database that was manually assembled to store in-depth analysis of interruption behaviors.
These two examples illustrate how transcript data from governance deliberations can be used to study an enormous range of consequential topics - from gendered speech patterns, to representative democratic outcomes. While these results are individually impactful, the ability to build and expand upon this research is limited because of expensive and time consuming processes required for manually collecting, processing, and structuring data for analysis. In the following sections we describe the content and structure of data made available by CDP instances and how we can make analyses of municipal governance both accessible and reproducible for research. In particular, we detail the construction of a dataset, Councils in Action, and describe how it was prepared as a corpus of machine readable transcripts ready for analysis. We then perform exploratory analysis to demonstrate the value of this corpus for municipal governance research.
The Councils in Action Dataset
This section replaces original paper text with auto-generated text to stay up-to-date with the state of the dataset.
Original Paper Text
Using Council Data Project infrastructures we assemble longitudinal data from across multiple municipal councils to ease manual curation for researchers. The proof-of-concept dataset, Councils in Action, is a corpus of over 350 meetings of the city councils of Seattle Washington and Portland Oregon and the county council of King County Washington. Each meeting in our dataset includes a video file, an audio file, a transcript, and the full meeting minutes (legislative items, votes, and attached documents). Table 1 provides specific details as to the number of meetings from each municipal council and their first and last event dates.
Instance
Events
First Event
Last Event
cdp-seattle-21723dcf
256
2021-01-04
2022-03-29
cdp-king-county-b656c71b
72
2021-10-05
2022-03-30
cdp-portland-d2bbda97
32
2021-07-07
2022-03-30
Show Code for Generated Paragraph
# Markdown renderingfrom IPython.display import Markdown# Core computationfrom datetime import datetimefrom cdp_backend.database import models as db_modelsfrom cdp_data import CDPInstancesfrom cdp_data.utils import connect_to_databaseimport numpy as npimport pandas as pd# Get all instance infrastructure stringsALL_INSTANCES = [getattr(CDPInstances, i)for i indir(CDPInstances) if"__"notin i]# Get dataset size and datesdata_coverage_list = []for instance in ALL_INSTANCES: connect_to_database(instance)# Get all events to calculate size events =list(db_models.Event.collection.fetch()) num_events =len(events)# Only continue if the instance has dataif num_events >0:# Get earliest event datetime current_earliest_datetime =None current_latest_datetime =Nonefor event in events:if ( current_earliest_datetime isNoneor event.event_datetime < current_earliest_datetime ): current_earliest_datetime = event.event_datetimeif ( current_latest_datetime isNoneor event.event_datetime > current_latest_datetime ): current_latest_datetime = event.event_datetime# Add instance data data_coverage_list.append({"Instance": instance,"Events": num_events,"Oldest Event": current_earliest_datetime.date().isoformat(),"Newest Event": current_latest_datetime.date().isoformat(), })# To dataframedata_coverage = pd.DataFrame(data_coverage_list)data_coverage = data_coverage.sort_values(by=["Events"], ascending=False)# MARKDOWN TEMPLATECOVERAGE_PARAGRAPH = ("As of `{generation_date}`, the Councils in Action dataset, includes data for ""`{n_meetings}` meetings from across `{n_instances}` different municipal councils. ""@tbl-cdp-data-coverage provides specific details about the number of meetings, ""and the oldest and newest event dates for each municipal council.")# Render paragraphMarkdown( COVERAGE_PARAGRAPH.format( generation_date=datetime.utcnow().date().isoformat(), n_meetings=data_coverage["Events"].sum(), n_instances=len(data_coverage.index), ).strip())
As of 2022-12-06, the Councils in Action dataset, includes data for 2586 meetings from across 17 different municipal councils. Table 1 provides specific details about the number of meetings, and the oldest and newest event dates for each municipal council.
Show Code for Rendering Table
# Import even more base firestore clientfrom google.auth.credentials import AnonymousCredentialsfrom google.cloud.firestore import Client# Import itables and make interactivefrom itables import showimport itables.options as table_optstable_opts.lengthMenu = [25, 50, 100]# Wrap all instances with links to the websitedef _wrap_infra_slug_in_website_link(infra_slug: str) ->str:# Connect db_client = Client( project=infra_slug, credentials=AnonymousCredentials(), )# Get metadata instance_metadata = db_client.document("metadata/configuration").get().to_dict() instance_webpage = instance_metadata["hosting_web_app_address"]# Wrap in markdownreturnf'<a href="{instance_webpage}">{infra_slug}</a>'render_ready_data_coverage = data_coverage.copy(deep=True)render_ready_data_coverage["Instance"] = render_ready_data_coverage["Instance"].apply( _wrap_infra_slug_in_website_link)render_ready_data_coverage = render_ready_data_coverage.set_index("Instance", drop=True)show(render_ready_data_coverage)
Table 1 is interactive. Try sorting and searching.
As described in Section 3.2, each CDP instance has a website to search, discover, and link data together for a single event. To serve researchers, we further make this dataset available via Python API and ZIP archive download. We provide the cdp-data Python library specifically to access, download, cache, and analyze the Councils in Action dataset. For full documentation of all functionality available in the cdp-data library please see the provided package documentation: https://councildataproject.org/cdp-data/. For lower-level, direct database access, we provide the Python library cdp-backend. More information on lower level access to each instance is made available on each CDP deployment’s repository README (i.e. https://github.com/CouncilDataProject/seattle) and extensive documentation as to the CDP database schema is made available via the cdp-backend package documentation: https://councildataproject.org/cdp-backend/.
The flexibility in data collection afforded by CDP’s distributed instance deployment model allows the dataset to rapidly scale both vertically (in the number of meetings for any single council) and horizontally (as more CDP deployments are created). Therefore, as more CDP instances are created, the generated Councils in Action dataset removes barriers to research that have previously been hindered by time-consuming manual data collection and analysis.
Exploratory Data Analysis
In the following section we use the Councils in Action dataset to explore and examine trends in council meetings, including public comments, over time. Our exploratory analysis focuses on keywords or N-grams. N-gram viewers have been commonly created to visualize trends in the usage of specific n-grams in large literature corpora over time [12]. Such approaches are often considered a way to ‘distantly read’ a corpus of texts [13]. Distant readings of council meetings can help understand broad trends in the way that a topic increases or decreases in importance during legislative processes. For example, if a topic decreases in frequency then, broadly, we can interpret this topic as being less important in the municipal government’s legislative agenda.
12.
Michel, J.-B., Shen, Y.K., Aiden, A.P., Veres, A., Gray, M.K., Team, G.B., Pickett, J.P., Hoiberg, D., Clancy, D., Norvig, P., et al. (2011). Quantitative analysis of culture using millions of digitized books. science 331, 176–182.
13.
Organisciak, P., Schmidt, B.M., and Downie, J.S. (2022). Giving shape to large digital libraries through exploratory data analysis. Journal of the Association for Information Science and Technology 73, 317–332.
Original Paper Text
For the Councils in Action dataset we apply the use of an n-gram visualization in order to demonstrate how topic trends evolve over time. First we use longitudinal data from the transcripts of Seattle City Council meetings to show the usage of specific n-grams as a percent of total n-grams used for each meeting during this time period. Figure 2 shows the usage of n-grams stemming from “police”, “housing”, “union”, and “homelessness” from January 1, 2021 to April 1, 2022 during meetings of the Seattle City Council.
from cdp_data import keywordsfrom datetime import datetimefrom dateutil.relativedelta import relativedeltaINVESTIGATION_NGRAMS = ["police","housing","union","homelessness",]# Get dataseattle_ngram_usage = keywords.compute_ngram_usage_history( infrastructure_slug=CDPInstances.Seattle, raise_on_error=False, tqdm_kws=dict(disable=True),)# Find earliest and latest datetimesearliest_dt = pd.to_datetime(seattle_ngram_usage.session_datetime.min())latest_dt = pd.to_datetime(seattle_ngram_usage.session_datetime.max())earliest_day = earliest_dt.date().isoformat()latest_day = latest_dt.date().isoformat()# Formatted investigation gramsformatted_investigation_grams =""for i, gram inenumerate(INVESTIGATION_NGRAMS):if i +1==len(INVESTIGATION_NGRAMS):# include and formatted_investigation_grams +=f", and `'{gram}'`"elif i ==0:# start formatted_investigation_grams +=f"`'{gram}'`"else:# move along formatted_investigation_grams +=f", `'{gram}'`"# MARKDOWN TEMPLATESEATTLE_NGRAMS_TIMESPAN_PARAGRAPH = (f"For the Councils in Action dataset we apply the use of an n-gram visualization "f"in order to demonstrate how keyword trends evolve over time. First we use "f"longitudinal data from the transcripts of Seattle City Council meetings to show "f"the usage of specific n-grams as a percent of total n-grams used for each meeting "f"during this time period. @fig-seattle-ngram-viewer shows the usage of n-grams "f"stemming from {formatted_investigation_grams} from `{earliest_day}` to "f"`{latest_day}` during meetings of the Seattle City Council.")# Render paragraphMarkdown(SEATTLE_NGRAMS_TIMESPAN_PARAGRAPH.strip())
For the Councils in Action dataset we apply the use of an n-gram visualization in order to demonstrate how keyword trends evolve over time. First we use longitudinal data from the transcripts of Seattle City Council meetings to show the usage of specific n-grams as a percent of total n-grams used for each meeting during this time period. Figure 3 shows the usage of n-grams stemming from 'police', 'housing', 'union', and 'homelessness' from 2020-01-06 to 2022-11-28 during meetings of the Seattle City Council.
INVESTIGATION_STATS = { gram: {"mean_min": {"value": None,"std": None,"month_start_dt": None, },"mean_max": {"value": None,"std": None,"month_start_dt": None, }, }for gram in INVESTIGATION_NGRAMS}# Convert these to months to interatively process the dataearliest_month = datetime(earliest_dt.year, earliest_dt.month, 1)latest_month = datetime(latest_dt.year, latest_dt.month, 1)# Iter months and keep track of high and low months for each investigation ngramcurrent_month = earliest_monthwhile current_month < latest_month:# Select data month_end = current_month + relativedelta(months=1) month_data = seattle_ngram_usage.loc[ (seattle_ngram_usage.session_datetime >= current_month.date().isoformat())& (seattle_ngram_usage.session_datetime < month_end.date().isoformat()) ]# Get statsfor ngram in INVESTIGATION_NGRAMS: stemmed_gram = keywords._stem_n_gram(ngram) month_ngram_selected_data = month_data.loc[month_data.ngram == stemmed_gram] month_day_ngram_percent_usage = month_ngram_selected_data.day_ngram_percent_usageiflen(month_day_ngram_percent_usage) >0: mean_percent_discussion = month_day_ngram_percent_usage.mean() std_percent_discussion = month_day_ngram_percent_usage.std()else: mean_percent_discussion =0.0 std_percent_discussion =0.0# Update ngram_stats = INVESTIGATION_STATS[ngram]if ( ngram_stats["mean_min"]["value"] isNoneor ngram_stats["mean_min"]["value"] > mean_percent_discussion ): ngram_stats["mean_min"]["value"] = mean_percent_discussion ngram_stats["mean_min"]["std"] = std_percent_discussion ngram_stats["mean_min"]["month_start_dt"] = current_monthif ( ngram_stats["mean_max"]["value"] isNoneor ngram_stats["mean_max"]["value"] < mean_percent_discussion ): ngram_stats["mean_max"]["value"] = mean_percent_discussion ngram_stats["mean_max"]["std"] = std_percent_discussion ngram_stats["mean_max"]["month_start_dt"] = current_month INVESTIGATION_STATS[ngram] = ngram_stats# Update current month current_month = month_end# MARKDOWN TEMPLATESEATTLE_NGRAMS_STATS_INTRO = ("To broaden our n-gram usage counting criteria to more than just our ""specific query grams, we stem all grams in the dataset using a ""[Snowball stemmer](https://www.nltk.org/api/nltk.stem.snowball.html?highlight=snowball%20stem#module-nltk.stem.snowball) ""to collect and plot the stemmed n-grams @nlp-python. This stemming helps ""collect and separate words together, for example, 'police' and 'policing' both ""stem from 'polic' but 'policy' stems from 'polici' @snowball-stemmer.")# Construct sentences for each gramEACH_GRAM_STATS = []for i, gram inenumerate(INVESTIGATION_STATS):# Construct main content gram_stats = INVESTIGATION_STATS[gram] single_ngram_stats = (f"percent usage of words stemming from `'{gram}'` "f"reached a maxmimum monthly average of "f"`{round(gram_stats['mean_max']['value'], 3)}` "f"± `{round(gram_stats['mean_max']['std'], 2)}` "f"from `{gram_stats['mean_max']['month_start_dt'].date().isoformat()}` to "f"`{(gram_stats['mean_max']['month_start_dt'] + relativedelta(months=1)).date().isoformat()}` "f"and a minimum monthly average of "f"`{round(gram_stats['mean_min']['value'], 3)}` "f"± `{round(gram_stats['mean_min']['std'], 2)}` "f"from `{gram_stats['mean_min']['month_start_dt'].date().isoformat()}` to "f"`{(gram_stats['mean_min']['month_start_dt'] + relativedelta(months=1)).date().isoformat()}`." )# Handle formattingif i +1==len(INVESTIGATION_STATS): chunk =f"Finally, the {single_ngram_stats}"elif i ==0: chunk =f"@fig-seattle-ngram-viewer shows that the {single_ngram_stats}"else: chunk =f"The {single_ngram_stats}"# Add this portion EACH_GRAM_STATS.append(chunk)# Combine intro and chunksjoined_gram_stats =" ".join(EACH_GRAM_STATS)joined_paragraph_parts ="\n\n".join([SEATTLE_NGRAMS_STATS_INTRO, joined_gram_stats])# Render paragraphMarkdown(joined_paragraph_parts.strip())
To broaden our n-gram usage counting criteria to more than just our specific query grams, we stem all grams in the dataset using a Snowball stemmer to collect and plot the stemmed n-grams [14]. This stemming helps collect and separate words together, for example, ‘police’ and ‘policing’ both stem from ‘polic’ but ‘policy’ stems from ‘polici’ [15].
14.
Bird, S., Klein, E., and Loper, E. (2009). Natural language processing with python: Analyzing text with the natural language toolkit (" O’Reilly Media, Inc.").
15.
Porter, M. (2001). A language for stemming algorithms. DOI= https://snowball. tartarus. org/texts/introduction. html.
Figure 3 shows that the percent usage of words stemming from 'police' reached a maxmimum monthly average of 1.17 ± 0.34 from 2020-06-01 to 2020-07-01 and a minimum monthly average of 0.017 ± 0.01 from 2020-04-01 to 2020-05-01. The percent usage of words stemming from 'housing' reached a maxmimum monthly average of 0.663 ± 0.67 from 2022-06-01 to 2022-07-01 and a minimum monthly average of 0.138 ± 0.05 from 2022-10-01 to 2022-11-01. The percent usage of words stemming from 'union' reached a maxmimum monthly average of 0.214 ± 0.27 from 2022-02-01 to 2022-03-01 and a minimum monthly average of 0.025 ± 0.01 from 2020-11-01 to 2020-12-01. Finally, the percent usage of words stemming from 'homelessness' reached a maxmimum monthly average of 0.216 ± 0.16 from 2020-02-01 to 2020-03-01 and a minimum monthly average of 0.047 ± 0.07 from 2022-10-01 to 2022-11-01.
Further, Figure 4 compares daily keyword usage across municipal councils.
Show Code for Getting Multiple Instance Data
from cdp_data import CDPInstances, keywords# Get dataselected_munis = keywords.compute_ngram_usage_history( infrastructure_slug=[ CDPInstances.Seattle, CDPInstances.Louisville, CDPInstances.Oakland, ], raise_on_error=False, tqdm_kws=dict(disable=True),)
In this paper, we have argued that the deployment of Council Data Project infrastructures to cover municipal councils is a solution to not only increasing access to data, but standardizing this data for eased analysis. We have demonstrated that, with the proof-of-concept Councils in Action dataset, data produced by CDP infrastructures can be easily processed and analyzed to observe shared and unique discussion trends across municipal councils. As the number of CDP instances increases, the Councils in Action dataset can be used for even more rich and varied analyses. For example, in their comprehensive study detailing who participates in local government meetings, Einstein et al., concluded that while there may be suggestive evidence that the trends they found hold for other states, the largest limitation of their work is that the data comes from a single state [10].
10.
Einstein, K.L., Palmer, M., and Glick, D.M. (2019). Who participates in local government? Evidence from meeting minutes. Perspectives on politics 17, 28–46.
16.
Bredin, H., Yin, R., Coria, J.M., Gelly, G., Korshunov, P., Lavechin, M., Fustes, D., Titeux, H., Bouaziz, W., and Gill, M.-P. (2020). pyannote.audio: neural building blocks for speaker diarization. In ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (IEEE), pp. 7124–7128.
While the findings of such work cannot be automated, the laborious annotation process required before research can be made easier with models to automatically annotate topical discussion, named entities and the linkage between discussion and legislative action, and the annotation of speaker turns. Because all CDP instances share common processing pipelines, delivering new features to each instance (municipality) that CDP covers is made simple. For example, to replicate a study like Jacobi and Schweers “Justice, Interrupted”, but for every municipality covered by CDP, we have already begun to work on a method for fine-tuning an audio-based speaker identification transformer to label each sentence in a transcript with the known speaker’s name and using speaker diarization for labeling each of the unknown speaker’s during each meeting [16].
CDP and the Councils in Action dataset can also potentially be used to measure and automatically track the provenance and discussion from legislative action from “model bills” across the country [17]. A more general form of such work might look to measure the topical and legislative diffusion across the country, for example answering the question: “how long does it take for similar legislative actions regarding a topic to occur in multiple different municipalities?” There are additional computational research questions available for investigation with the Councils in Actions dataset such as the research and development of methods for minutes items and transcript alignment or even more generally, models for “outline generation” to automatically generate the minutes items of a meeting from a transcript [18,19].
Tardy, P., Janiszek, D., Estève, Y., and Nguyen, V. (2020). Align then summarize: Automatic alignment methods for summarization corpus creation. arXiv preprint arXiv:2007.07841.
19.
Zhang, R., Guo, J., Fan, Y., Lan, Y., and Cheng, X. (2019). Outline generation: Understanding the inherent content structure of documents. In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, pp. 745–754.
Lastly, we emphasize that our proof-of-concept work in this paper demonstrates the possibility of research produced using the Councils in Actions dataset to make its way back into the CDP instances themselves to remove barriers to municipal information for all members of the communities they serve. Council Data Project, as an open infrastructure platform, affords researchers, journalists, activists, and community members the opportunity to directly integrate their work with the data processing pipelines and/or web applications that are connected to CDP deployments. Integration efforts can directly support others working with the Councils in Action dataset, or members of the public hoping to understand the larger context of discussion, track legislative action, and hold elected officials accountable.
Source Code
---title: "Councils in Action"subtitle: "An exploration of data made available by Council Data Project"author: "Eva Maxfield Brown"date: "October 11, 2022"date-modified: "`r Sys.Date()`"abstract: "Large scale comparative research into municipal governance is often prohibitively difficult due to a lack of high-quality data. Recent advances in speech-to-text algorithms and natural language processing techniques has made it possible to more easily collect and analyze this type of data. In this paper, we introduce an open-source platform, the [Council Data Project](https://councildataproject.org) (CDP), to curate novel datasets for research into municipal governance. The contribution of this work is two-fold: 1. We demonstrate that CDP, as an infrastructure, can be used to assemble reliable comparative data across municipalities; 2. We provide exploratory analysis to show how CDP data can be used to gain insight into how municipal governments perform over time. We conclude by describing future directions for research on and with CDP such as the development of machine learning models for speaker annotation, outline generation, and named entity recognition to the linking of data for large-scale comparative research."toc: truetoc-location: leftformat: html: code-tools: true standalone: true embed-resources: truereference-location: margincitation-location: marginbibliography: support/main.bibcsl: support/nature.csl---## QuartoThis document will explain what [Council Data Project](https://councildataproject.org) (CDP) is and how the data generated and stored by CDP instances make large-scale computational social and political science studies possible on the municipal level. It is an slightly modified, up-to-date, interactive version our ASIS&T AM22 conference paper [@cdp-asist-am22].If you want to skip to the data gather, visualization, and plotting code, jump to [#the-councils-in-action-dataset](#the-councils-in-action-dataset).## IntroductionFederalism, where power devolves to states and municipalities, is a defining feature of US democracy. But, federalism also poses substantial challenges to the study of state and local governments - each state and each city or county in the USA has their own rules and regulations that structure a mode of governance. For this reason, political science research into local government is often referred to as a "black hole" @urban-politics. It can be an extended exercise in data collection to determine the identity of elected officials across the 89,000 local governments in the USA, let alone measure and compare the performance of one system of local governance to another @crowdsourcing-local-data.Despite this challenge, there are common legal requirements for transparency in state and local government legislatures that can improve the quality, scale, and usefulness of data used to study local governance. Open meeting laws, also called sunshine laws, require that most meetings of state and local governments be open to the public, along with their decisions and records @open-meetings. This means, in theory, recordings of a legislative session, bills and materials supporting the proposal of a bill, as well as voting outcomes are required to be publicly accessible. In practice, access to recordings and voting records is often hampered by closed, proprietary information systems that are difficult to navigate, search, and extract meaningful data for both citizens and researchers alike @gender-matter-leadership.Over the last two years, the Council Data Project (CDP) has developed an open-source software platform that significantly improves access and engagement with local government data @cdp-joss. In the following paper we provide a brief background of CDP, and how this open-source platform enables large-scale, longitudinal, comparative research into municipal governance. In doing so, we position our work amongst other large-scale, public interest technology projects, and describe how CDP uniquely enables the assembly of comparative data for analysis. We then use CDP to assemble a longitudinal dataset from three different municipal councils (Seattle City Council, Portland City Council, and King County Council). We use this dataset to perform exploratory data analysis through the development of an N-gram plotting tool. Finally, we conclude by describing our own future work and the future work made possible by CDP and a growing data collection that we refer to as "Councils in Action."::: {.callout-note appearance="simple"}Council Data Project data now covers many more cities and counties. As this is a generated document, we will not limit ourselves to just examining Seattle City Council, Portland City Council, and King County Council. Rather, all plots and analyses will be updated each week with the latest data as we re-generate this arcticle.:::## Background### Public Interest Technology for Municipal EventsA number of previous civic technology applications have created valuable and accessible local government data from publicly available information, but none have specifically focused on aggregating transcripts of legislative discussion. In the following section we review four of these applications, and note how they relate to but differ from CDP. [Councilmatic](https://github.com/codeforamerica/councilmatic) was one of the first public interest technology projects to focus on making council information more accessible. Councilmatic is a system for processing and archiving past municipal council meetings and legislative information and tracking upcoming meetings. Councilmatic is entirely open-source and there are now working examples of this application being used in cities throughout the USA, including for Los Angeles, Chicago, New York City, and Oakland [@councilmatic]. However, each instance of Councilmatic, at a city or county level, is entirely separate from each other in their setup and maintenance. This distributed architecture makes it difficult to collaboratively develop new features, and prohibits cross-municipality data aggregation.[Local Voices Network](https://lvn.org/) (LVN), a project from Cortico AI, provides a powerful platform for generating, visualizing, and searching through transcripts of civic "conversations." LVN tools are highly targeted at reaching out to communities, facilitating a small-group conversation, and making such conversations easy to digest via machine learning analysis and visualization [@local-voices-network]. Such facilitation of community conversations engineers a discussion about specific topics and places them into a standardized format rather than curating existing data into a standard form. LVN produces novel insight into community sentiment but at the high cost of facilitation and engineering.[Big Local News](https://github.com/biglocalnews) (BLN), from researchers at Stanford University has created a platform to obtain data, through web-scraping, from municipal councils across the country. Each scraper from BLN collects meeting assets such as documents, presentations, videos, captions, etc.. Further, there is currently no processing of the scraped data (e.g. transcript generation) in order to expand the usage of these documents to include council discussions for analysis.[Blockparty](https://blockparty.studio) is an emergent civic technology project which generates and analyzes meeting transcripts from New York City community council meetings. Blockparty creates and processes meeting transcripts to extract keywords and other potential highlights, and then publishes both the transcript and a keyword histogram to the web. Blockparty currently only serves New York City and without open-source code we are limited in understanding their deployment and data access mechanisms. Specifically, we do not know how researchers and civic hackers alike might deploy Blockparty for their own municipality, and more importantly, there is no structured open data produced by this project that can be analyzed for research purposes.Lastly, while many projects are focused on the collection and republishing of municipal meeting data (from documents, videos, and in some cases transcripts), it is rare for a project to address the data collection, aggregation, transformation, and linkage of data together. For example, Councilmatic is the only project from prior examples which additionally stores legislative outcomes and voting records. The storage of such data is a valuable attachment which allows for investigation not only from discussion but additionally against the legislative end result.### Council Data Project {#sec-cdp}Council Data Project (CDP) attempts to improve upon state-of-the-art public interest technology projects by providing a low-cost, flexible and scalable open-source solution for generating a large standardized corpora of municipal council meetings. CDP can be deployed in any municipality with minimal configuration by a developer or city IT department. Using a simple Python-based ‘cookie-cutter template’ [@cookiecutter] a developer can configure a new CDP deployment with just a two-line installation process and fully deploy the instance once provided a function to gather events [@cdp-joss]. Once fully deployed, an instance of CDP will collect and process municipal meeting minutes, agendas, voting records, and crucially, the recorded video or audio of each meeting that is archived by a municipality. For every legislative meeting that CDP processes, the system generates a transcript from the provided video using a machine learning model that converts recorded audio to text (aka speech-to-text processing). On a continuous schedule CDP infrastructure then uses this corpus of transcripts to generate and update a keyword based index to enable search of meetings by keyword [@cdp-joss].To make CDP data more accessible to the public, and to provide an interface for building a deeper, contextual understanding, each CDP infrastructure also publishes a website (see @fig-seattle-events-page). Users of CDP websites can search for municipal meetings using the keyword index and then retrieve the video of a meeting, the sentence time-stamped transcript (for reading and for jumping playback of video to a sentence start point), access the full minutes of the meeting, and view votes that took place during a meeting. Further, because CDP transforms municipal government data into a database specification, CDP also aggregates and publishes aggregate data - such as the entire voting records of city council members, or document timestamps that show when specific actions were taken on a piece of municipal legislation.![Event search results for query: "defund the police" using the Seattle City Council CDP instance website. The search page includes event ‘cards’ which contain a thumbnail from the meeting video, the meeting date, the meeting committee name, a snippet from the meeting transcript that contains and highlights one or more of the keyword search terms which were found in the transcript, and the meeting keywords. Above the event cards, are filters for the returned query for filtering by committee and date. Additionally there are options for sorting the returned results. The Seattle CDP instance website is available at: [https://councildataproject.org/seattle/](https://councildataproject.org/seattle/)](support/figs/seattle-events-screen.png){#fig-seattle-events-page fig-cap-location=margin}CDP imposes minimal requirements as to the level of basic information that must be collected for a municipal event to be identified, accessed, and processed. At a bare minimum, for CDP infrastructures to process and store an event, the system must be given a URL to a video of the meeting, the date of the meeting, and the name of the meeting committee (i.e. "Full Council", "Transportation Committee", etc.). This allows CDP instances to be deployed for less resource-available councils (i.e. school boards, neighborhood zoning boards, etc.) while still producing a standardized transcript and access mechanism to both view and download the data for further exploration and analysis. Data ingestion can be customized for each CDP deployment, but the core processing pipelines, infrastructure configuration, and web application are all shared across any city that deploys CDP. This allows for a much easier and larger collaborative effort between developers and open-source software contributors. In the following section, we detail previous text-as-data datasets from government sources that have been constructed and how such CDP deployments can be utilized to compile a large corpus of municipal meeting transcripts for analysis.### Previous Municipal Meeting DatasetsMunicipal meeting data is used across a number of domains of research that are interested in the institutional design and functioning of local governments - from political science and sociology to legal scholarship. In the following section we highlight a study from Einstein et al. which investigated who participates in local meetings and a study from Jacobi and Schweers which investigated how gender, ideology, and seniority affect Supreme Court oral argument. Both studies relied upon utilizing meeting records (video or transcript) for analysis of participant behaviors.Einstein et al. provided a comprehensive look into who participates in local government. Specifically, they "[compiled and coded] new data on all citizen participants in planning and zoning board meetings dealing with the construction of multiple housing units in 97 Massachusetts cities and towns." The researchers then matched, "thousands of individual participants to the Massachusetts voter file to explore who participates in local political meetings" [@einstein-local-gov-participates]. This paper utilized text annotation and topic and sentiment encoding to first identify participants, and then determine what each participant did and did not support in regards to specific planning and zoning discussions. In their study, data collection and coding were done in a combined manual and automated process. Public comment coding and annotation was completed by identifying participants' names and addresses when they spoke. Once the data had been manually collected, Einstein et al. used probabilistic name and address matching with Massachusetts voting records in order to match each participant to their voting record details and then manually verified matches.In a similar study which used mixed manual and automated methods of constructing a dataset for federal governance research, Jacobi and Schweers attempted to measure the effect of gender, ideology, and seniority at Supreme Court oral arguments. Their work processed hundreds of transcripts to search and record interruptions between the legal advocates and the Supreme Court Justices (and between Justices) [@justice-interrupted]. Jacobi and Schweers work was made possible by two separate databases: an existing publicly available database of specifically Roberts Court oral arguments and a second database that was manually assembled to store in-depth analysis of interruption behaviors.These two examples illustrate how transcript data from governance deliberations can be used to study an enormous range of consequential topics - from gendered speech patterns, to representative democratic outcomes. While these results are individually impactful, the ability to build and expand upon this research is limited because of expensive and time consuming processes required for manually collecting, processing, and structuring data for analysis. In the following sections we describe the content and structure of data made available by CDP instances and how we can make analyses of municipal governance both accessible and reproducible for research. In particular, we detail the construction of a dataset, Councils in Action, and describe how it was prepared as a corpus of machine readable transcripts ready for analysis. We then perform exploratory analysis to demonstrate the value of this corpus for municipal governance research.## The Councils in Action Dataset::: {.callout-note appearance="simple"}This section replaces original paper text with auto-generated text to stay up-to-date with the state of the dataset.:::<details><summary>Original Paper Text</summary>Using Council Data Project infrastructures we assemble longitudinal data from across multiple municipal councils to ease manual curation for researchers. The proof-of-concept dataset, Councils in Action, is a corpus of over 350 meetings of the city councils of Seattle Washington and Portland Oregon and the county council of King County Washington. Each meeting in our dataset includes a video file, an audio file, a transcript, and the full meeting minutes (legislative items, votes, and attached documents). Table 1 provides specific details as to the number of meetings from each municipal council and their first and last event dates.| Instance | Events | First Event | Last Event ||:-------------------------|-------:|:------------|:-----------|| cdp-seattle-21723dcf | 256 | 2021-01-04 | 2022-03-29 || cdp-king-county-b656c71b | 72 | 2021-10-05 | 2022-03-30 || cdp-portland-d2bbda97 | 32 | 2021-07-07 | 2022-03-30 |</details>```{python}#| code-fold: true#| code-summary: "Show Code for Generated Paragraph"# Markdown renderingfrom IPython.display import Markdown# Core computationfrom datetime import datetimefrom cdp_backend.database import models as db_modelsfrom cdp_data import CDPInstancesfrom cdp_data.utils import connect_to_databaseimport numpy as npimport pandas as pd# Get all instance infrastructure stringsALL_INSTANCES = [getattr(CDPInstances, i)for i indir(CDPInstances) if"__"notin i]# Get dataset size and datesdata_coverage_list = []for instance in ALL_INSTANCES: connect_to_database(instance)# Get all events to calculate size events =list(db_models.Event.collection.fetch()) num_events =len(events)# Only continue if the instance has dataif num_events >0:# Get earliest event datetime current_earliest_datetime =None current_latest_datetime =Nonefor event in events:if ( current_earliest_datetime isNoneor event.event_datetime < current_earliest_datetime ): current_earliest_datetime = event.event_datetimeif ( current_latest_datetime isNoneor event.event_datetime > current_latest_datetime ): current_latest_datetime = event.event_datetime# Add instance data data_coverage_list.append({"Instance": instance,"Events": num_events,"Oldest Event": current_earliest_datetime.date().isoformat(),"Newest Event": current_latest_datetime.date().isoformat(), })# To dataframedata_coverage = pd.DataFrame(data_coverage_list)data_coverage = data_coverage.sort_values(by=["Events"], ascending=False)# MARKDOWN TEMPLATECOVERAGE_PARAGRAPH = ("As of `{generation_date}`, the Councils in Action dataset, includes data for ""`{n_meetings}` meetings from across `{n_instances}` different municipal councils. ""@tbl-cdp-data-coverage provides specific details about the number of meetings, ""and the oldest and newest event dates for each municipal council.")# Render paragraphMarkdown( COVERAGE_PARAGRAPH.format( generation_date=datetime.utcnow().date().isoformat(), n_meetings=data_coverage["Events"].sum(), n_instances=len(data_coverage.index), ).strip())``````{python}#| code-fold: true#| code-summary: "Show Code for Rendering Table"#| label: tbl-cdp-data-coverage#| tbl-cap: Councils in Action Data Coverage#| tbl-cap-location: margin# Import even more base firestore clientfrom google.auth.credentials import AnonymousCredentialsfrom google.cloud.firestore import Client# Import itables and make interactivefrom itables import showimport itables.options as table_optstable_opts.lengthMenu = [25, 50, 100]# Wrap all instances with links to the websitedef _wrap_infra_slug_in_website_link(infra_slug: str) ->str:# Connect db_client = Client( project=infra_slug, credentials=AnonymousCredentials(), )# Get metadata instance_metadata = db_client.document("metadata/configuration").get().to_dict() instance_webpage = instance_metadata["hosting_web_app_address"]# Wrap in markdownreturnf'<a href="{instance_webpage}">{infra_slug}</a>'render_ready_data_coverage = data_coverage.copy(deep=True)render_ready_data_coverage["Instance"] = render_ready_data_coverage["Instance"].apply( _wrap_infra_slug_in_website_link)render_ready_data_coverage = render_ready_data_coverage.set_index("Instance", drop=True)show(render_ready_data_coverage)```::: {.callout-tip appearance="simple"}@tbl-cdp-data-coverage is interactive. Try sorting and searching.:::As described in @sec-cdp, each CDP instance has a website to search, discover, and link data together for a single event. To serve researchers, we further make this dataset available via Python API and ZIP archive download. We provide the `cdp-data` Python library specifically to access, download, cache, and analyze the Councils in Action dataset. For full documentation of all functionality available in the cdp-data library please see the provided package documentation: [https://councildataproject.org/cdp-data/](https://councildataproject.org/cdp-data/). For lower-level, direct database access, we provide the Python library `cdp-backend`. More information on lower level access to each instance is made available on each CDP deployment’s repository README (i.e. [https://github.com/CouncilDataProject/seattle](https://github.com/CouncilDataProject/seattle)) and extensive documentation as to the CDP database schema is made available via the cdp-backend package documentation: [https://councildataproject.org/cdp-backend/](https://councildataproject.org/cdp-backend/).The flexibility in data collection afforded by CDP’s distributed instance deployment model allows the dataset to rapidly scale both vertically (in the number of meetings for any single council) and horizontally (as more CDP deployments are created). Therefore, as more CDP instances are created, the generated Councils in Action dataset removes barriers to research that have previously been hindered by time-consuming manual data collection and analysis.## Exploratory Data AnalysisIn the following section we use the Councils in Action dataset to explore and examine trends in council meetings, including public comments, over time. Our exploratory analysis focuses on keywords or N-grams. N-gram viewers have been commonly created to visualize trends in the usage of specific n-grams in large literature corpora over time @google-ngrams. Such approaches are often considered a way to ‘distantly read’ a corpus of texts @digital-libraries-explorer. Distant readings of council meetings can help understand broad trends in the way that a topic increases or decreases in importance during legislative processes. For example, if a topic decreases in frequency then, broadly, we can interpret this topic as being less important in the municipal government’s legislative agenda.<details><summary>Original Paper Text</summary>For the Councils in Action dataset we apply the use of an n-gram visualization in order to demonstrate how topic trends evolve over time. First we use longitudinal data from the transcripts of Seattle City Council meetings to show the usage of specific n-grams as a percent of total n-grams used for each meeting during this time period. @fig-original-paper-seattle-ngram-viewer shows the usage of n-grams stemming from "police", "housing", "union", and "homelessness" from January 1, 2021 to April 1, 2022 during meetings of the Seattle City Council.```{python}#| code-fold: true#| code-summary: "Show Code for Pulling Original N-Gram Viewer Data"#| warning: false# Import keywords utilitiesfrom cdp_data import keywords# ComputationORIGINAL_PAPER_PLOT_NGRAMS = ["police","housing","union","homelessness",]original_paper_seattle_ngram_usage = keywords.compute_ngram_usage_history( infrastructure_slug=CDPInstances.Seattle, ngram_size=1, # generate unigrams strict=False, # stem grams start_datetime="2021-01-01", end_datetime="2022-04-01", tqdm_kws=dict(disable=True),)``````{python}#| code-fold: true#| code-summary: "Show Code for Original N-Gram Viewer"#| warning: false#| label: fig-original-paper-seattle-ngram-viewer#| fig-cap: N-gram usage over time for Seattle City Council meetings from January 1, 2021 to April 1, 2022. The selected n-grams are "polic" (the stem of police, policing, etc.), "hous" (the stem of house, housing, etc.), "union", and "homeless". The y-axis represents the usage of each n-gram – the percent of the number of times the specific n-gram was used for the day over the total number of n-grams used during the day.from cdp_data import plotsgrid = plots.plot_ngram_usage_histories( ngram=ORIGINAL_PAPER_PLOT_NGRAMS, gram_usage=original_paper_seattle_ngram_usage, strict=False, # stem provided grams lmplot_kws=dict( # extra plotting params col="ngram", hue="ngram", col_wrap=2, scatter_kws={"alpha": 0.2}, aspect=1.6, ), tqdm_kws=dict(disable=True),)grid.fig.set_size_inches(6.8, 4.857)```</details>```{python}#| code-fold: true#| code-summary: "Show Code for Generating Paragraph"#| warning: falsefrom cdp_data import keywordsfrom datetime import datetimefrom dateutil.relativedelta import relativedeltaINVESTIGATION_NGRAMS = ["police","housing","union","homelessness",]# Get dataseattle_ngram_usage = keywords.compute_ngram_usage_history( infrastructure_slug=CDPInstances.Seattle, raise_on_error=False, tqdm_kws=dict(disable=True),)# Find earliest and latest datetimesearliest_dt = pd.to_datetime(seattle_ngram_usage.session_datetime.min())latest_dt = pd.to_datetime(seattle_ngram_usage.session_datetime.max())earliest_day = earliest_dt.date().isoformat()latest_day = latest_dt.date().isoformat()# Formatted investigation gramsformatted_investigation_grams =""for i, gram inenumerate(INVESTIGATION_NGRAMS):if i +1==len(INVESTIGATION_NGRAMS):# include and formatted_investigation_grams +=f", and `'{gram}'`"elif i ==0:# start formatted_investigation_grams +=f"`'{gram}'`"else:# move along formatted_investigation_grams +=f", `'{gram}'`"# MARKDOWN TEMPLATESEATTLE_NGRAMS_TIMESPAN_PARAGRAPH = (f"For the Councils in Action dataset we apply the use of an n-gram visualization "f"in order to demonstrate how keyword trends evolve over time. First we use "f"longitudinal data from the transcripts of Seattle City Council meetings to show "f"the usage of specific n-grams as a percent of total n-grams used for each meeting "f"during this time period. @fig-seattle-ngram-viewer shows the usage of n-grams "f"stemming from {formatted_investigation_grams} from `{earliest_day}` to "f"`{latest_day}` during meetings of the Seattle City Council.")# Render paragraphMarkdown(SEATTLE_NGRAMS_TIMESPAN_PARAGRAPH.strip())``````{python}#| code-fold: true#| code-summary: "Show Code for N-Gram Viewer"#| warning: false#| label: fig-seattle-ngram-viewer#| fig-cap: N-gram usage over time for Seattle City Council meetings. The x-axis represents individual days. The y-axis represents the usage of each n-gram – the percent of the number of times the specific n-gram was used for the day over the total number of n-grams used during the day.#| fig-cap-location: marginfrom cdp_data import plotsplots.set_cdp_plotting_styles()# Plotgrid = plots.plot_ngram_usage_histories( INVESTIGATION_NGRAMS, seattle_ngram_usage, lmplot_kws=dict( # extra plotting params col="ngram", col_wrap=2, hue="ngram", scatter_kws={"alpha": 0.2}, aspect=1.6, ), tqdm_kws=dict(disable=True),)grid.fig.set_size_inches(6.8, 4.857)``````{python}#| code-fold: true#| code-summary: "Show Code for Generating Paragraph"#| warning: falseINVESTIGATION_STATS = { gram: {"mean_min": {"value": None,"std": None,"month_start_dt": None, },"mean_max": {"value": None,"std": None,"month_start_dt": None, }, }for gram in INVESTIGATION_NGRAMS}# Convert these to months to interatively process the dataearliest_month = datetime(earliest_dt.year, earliest_dt.month, 1)latest_month = datetime(latest_dt.year, latest_dt.month, 1)# Iter months and keep track of high and low months for each investigation ngramcurrent_month = earliest_monthwhile current_month < latest_month:# Select data month_end = current_month + relativedelta(months=1) month_data = seattle_ngram_usage.loc[ (seattle_ngram_usage.session_datetime >= current_month.date().isoformat())& (seattle_ngram_usage.session_datetime < month_end.date().isoformat()) ]# Get statsfor ngram in INVESTIGATION_NGRAMS: stemmed_gram = keywords._stem_n_gram(ngram) month_ngram_selected_data = month_data.loc[month_data.ngram == stemmed_gram] month_day_ngram_percent_usage = month_ngram_selected_data.day_ngram_percent_usageiflen(month_day_ngram_percent_usage) >0: mean_percent_discussion = month_day_ngram_percent_usage.mean() std_percent_discussion = month_day_ngram_percent_usage.std()else: mean_percent_discussion =0.0 std_percent_discussion =0.0# Update ngram_stats = INVESTIGATION_STATS[ngram]if ( ngram_stats["mean_min"]["value"] isNoneor ngram_stats["mean_min"]["value"] > mean_percent_discussion ): ngram_stats["mean_min"]["value"] = mean_percent_discussion ngram_stats["mean_min"]["std"] = std_percent_discussion ngram_stats["mean_min"]["month_start_dt"] = current_monthif ( ngram_stats["mean_max"]["value"] isNoneor ngram_stats["mean_max"]["value"] < mean_percent_discussion ): ngram_stats["mean_max"]["value"] = mean_percent_discussion ngram_stats["mean_max"]["std"] = std_percent_discussion ngram_stats["mean_max"]["month_start_dt"] = current_month INVESTIGATION_STATS[ngram] = ngram_stats# Update current month current_month = month_end# MARKDOWN TEMPLATESEATTLE_NGRAMS_STATS_INTRO = ("To broaden our n-gram usage counting criteria to more than just our ""specific query grams, we stem all grams in the dataset using a ""[Snowball stemmer](https://www.nltk.org/api/nltk.stem.snowball.html?highlight=snowball%20stem#module-nltk.stem.snowball) ""to collect and plot the stemmed n-grams @nlp-python. This stemming helps ""collect and separate words together, for example, 'police' and 'policing' both ""stem from 'polic' but 'policy' stems from 'polici' @snowball-stemmer.")# Construct sentences for each gramEACH_GRAM_STATS = []for i, gram inenumerate(INVESTIGATION_STATS):# Construct main content gram_stats = INVESTIGATION_STATS[gram] single_ngram_stats = (f"percent usage of words stemming from `'{gram}'` "f"reached a maxmimum monthly average of "f"`{round(gram_stats['mean_max']['value'], 3)}` "f"± `{round(gram_stats['mean_max']['std'], 2)}` "f"from `{gram_stats['mean_max']['month_start_dt'].date().isoformat()}` to "f"`{(gram_stats['mean_max']['month_start_dt'] + relativedelta(months=1)).date().isoformat()}` "f"and a minimum monthly average of "f"`{round(gram_stats['mean_min']['value'], 3)}` "f"± `{round(gram_stats['mean_min']['std'], 2)}` "f"from `{gram_stats['mean_min']['month_start_dt'].date().isoformat()}` to "f"`{(gram_stats['mean_min']['month_start_dt'] + relativedelta(months=1)).date().isoformat()}`." )# Handle formattingif i +1==len(INVESTIGATION_STATS): chunk =f"Finally, the {single_ngram_stats}"elif i ==0: chunk =f"@fig-seattle-ngram-viewer shows that the {single_ngram_stats}"else: chunk =f"The {single_ngram_stats}"# Add this portion EACH_GRAM_STATS.append(chunk)# Combine intro and chunksjoined_gram_stats =" ".join(EACH_GRAM_STATS)joined_paragraph_parts ="\n\n".join([SEATTLE_NGRAMS_STATS_INTRO, joined_gram_stats])# Render paragraphMarkdown(joined_paragraph_parts.strip())```Further, @fig-multi-infra-ngram-viewer compares daily keyword usage across municipal councils.```{python}#| code-fold: true#| code-summary: "Show Code for Getting Multiple Instance Data"#| warning: falsefrom cdp_data import CDPInstances, keywords# Get dataselected_munis = keywords.compute_ngram_usage_history( infrastructure_slug=[ CDPInstances.Seattle, CDPInstances.Louisville, CDPInstances.Oakland, ], raise_on_error=False, tqdm_kws=dict(disable=True),)``````{python}#| code-fold: true#| code-summary: "Show Code for Plotting Multiple Instance Data"#| warning: false#| label: fig-multi-infra-ngram-viewer#| fig-cap: N-gram usage over time from selected municipal councils. The x-axis represents individual days. The y-axis represents the usage of each n-gram – the percent of the number of times the specific n-gram was used for the day over the total number of n-grams used during the day.#| fig-cap-location: marginfrom cdp_data import plotsplots.set_cdp_plotting_styles()grid = plots.plot_ngram_usage_histories( ["police", "housing"], selected_munis, lmplot_kws=dict( # extra plotting params col="ngram", row="infrastructure", hue="ngram", scatter_kws={"alpha": 0.2}, aspect=1.6, ), tqdm_kws=dict(disable=True),)grid.fig.set_size_inches(6.4, 2.6*3)```## ConclusionIn this paper, we have argued that the deployment of Council Data Project infrastructures to cover municipal councils is a solution to not only increasing access to data, but standardizing this data for eased analysis. We have demonstrated that, with the proof-of-concept Councils in Action dataset, data produced by CDP infrastructures can be easily processed and analyzed to observe shared and unique discussion trends across municipal councils. As the number of CDP instances increases, the Councils in Action dataset can be used for even more rich and varied analyses. For example, in their comprehensive study detailing who participates in local government meetings, Einstein et al., concluded that while there may be suggestive evidence that the trends they found hold for other states, the largest limitation of their work is that the data comes from a single state @einstein-local-gov-participates.While the findings of such work cannot be automated, the laborious annotation process required before research can be made easier with models to automatically annotate topical discussion, named entities and the linkage between discussion and legislative action, and the annotation of speaker turns. Because all CDP instances share common processing pipelines, delivering new features to each instance (municipality) that CDP covers is made simple. For example, to replicate a study like Jacobi and Schweers "Justice, Interrupted", but for every municipality covered by CDP, we have already begun to work on a [method for fine-tuning an audio-based speaker identification transformer](https://github.com/councildataproject/speakerbox) to label each sentence in a transcript with the known speaker’s name and using speaker diarization for labeling each of the unknown speaker’s during each meeting @pyannote.CDP and the Councils in Action dataset can also potentially be used to measure and automatically track the provenance and discussion from legislative action from "model bills" across the country @alec-exposed. A more general form of such work might look to measure the topical and legislative diffusion across the country, for example answering the question: "how long does it take for similar legislative actions regarding a topic to occur in multiple different municipalities?" There are additional computational research questions available for investigation with the Councils in Actions dataset such as the research and development of methods for minutes items and transcript alignment or even more generally, models for "outline generation" to automatically generate the minutes items of a meeting from a transcript [@align-then-summarize;@outline-generation].Lastly, we emphasize that our proof-of-concept work in this paper demonstrates the possibility of research produced using the Councils in Actions dataset to make its way back into the CDP instances themselves to remove barriers to municipal information for all members of the communities they serve. Council Data Project, as an open infrastructure platform, affords researchers, journalists, activists, and community members the opportunity to directly integrate their work with the data processing pipelines and/or web applications that are connected to CDP deployments. Integration efforts can directly support others working with the Councils in Action dataset, or members of the public hoping to understand the larger context of discussion, track legislative action, and hold elected officials accountable.