Online comment floods during public consultations have posed unique governance challenges for regulatory bodies seeking relevant information on proposed regulations. How should regulatory bodies separate spam and fake comments from genuine submissions by the public, especially when fake comments are designed to imitate ordinary citizens? How can regulatory bodies achieve both breadth and depth in their citations to the comment corpus? What is the best way to select comments that represent the average submission and comments that supply highly specialized information?
We present the comment corpus from the Federal Communications Commission’s (FCC) 2017 “Restoring Internet Freedom” proceeding, augmented with metadata to assist in prototyping innovative search and discovery techniques. This data release contains structured metadata and the raw text of nearly 24 million comments submitted during the proceeding. The comment data were downloaded directly from the FCC’s Electronic Comment Filing System (ECFS) between January and February of 2019, processed to be in a consistent format (machine-readable pdf or plain text), and augmented with information on which comments were cited in the FCC’s final order.
The release also includes query-term and document-term matrices to facilitate keyword searches on the comment corpus. An example of how these can be used with the bm25 algorithm can be found here.
You can find the dataset here.
Reference
Handan-Nader, Cassandra. 2022. Do fake online comments pose a threat to regulatory policymaking? Evidence from Internet regulation in the United States. Policy & Internet. https://doi.org/10.1002/poi3.327