The corpus as a network¶

Turning source documents into a graph with NLP

Moritz Mähr (University of Bern)

December 12, 2022

Lecture series "Einblicke in die Digital Humanities" (fall semester 2022)

Overview¶

  • The Evolution of Internet Governance
  • Turning source documents into a graph
    • Walkthrough of the Natural Language Processing (NLP)
    • Walkthrough of the Social Network Analysis (SNA)
  • Recap

The Evolution of Internet Governance¶

A digital history on the agency of technology standards.

  • Research question: Which actors play a role in Internet Governance and how has their role evolved over time?
  • Research data: a corpus of born digital standard documents for Internet technologies such as e-mail, WWW, etc.
  • Quantitative research methods: Natural Language Processing (NLP) and Social Network Analysis (SNA)

The RFC Editor¶

  • The RFC Editor is the oldest and most important publication for Internet standards
  • Internet standards like IP, e-mail, WWW, etc. are published as RFCs (Request for Comments)
  • Since 1969, it has published more than 9,000 RFCs (over 225,000 pages)
  • Although the process and institutional framework of standard setting have changed, the format of the RFC has remained stable
  • All RFCs are publicly available under www.rfc-editor.org

An example: RFC 1945¶

The Hypertext Transfer Protocol (HTTP) is an application-level protocol with the lightness and speed necessary for distributed, collaborative, hypermedia information systems. It is a generic, stateless, object-oriented protocol which can be used for many tasks, such as name servers and distributed object management systems, through extension of its request methods (commands). A feature of HTTP is the typing of data representation, allowing systems to be built independently of the data being transferred.

Retrieving the source document¶

In [2]:
import requests

rfc1945 = requests.get("https://www.rfc-editor.org/rfc/rfc1945.txt").text

Looking at the source document¶

In [3]:
print(rfc1945_excerpt := rfc1945[6:917])
Network Working Group                                     T. Berners-Lee
Request for Comments: 1945                                       MIT/LCS
Category: Informational                                      R. Fielding
                                                               UC Irvine
                                                              H. Frystyk
                                                                 MIT/LCS
                                                                May 1996


                Hypertext Transfer Protocol -- HTTP/1.0

Status of This Memo

   This memo provides information for the Internet community.  This memo
   does not specify an Internet standard of any kind.  Distribution of
   this memo is unlimited.

IESG Note:

   The IESG has concerns about this protocol, and expects this document
   to be replaced relatively soon by a standards track document.

Some remarks about the source document¶

  • The source documents are very highly structured, which makes the extraction of metadata "relatively easy".
  • Fortunatly the RFC Editor already did that for us www.rfc-editor.org/rfc/rfc1945.json

Retrieving the metadata¶

In [4]:
import json

rfc1945_metadata = json.loads(
    requests.get("https://www.rfc-editor.org/rfc/rfc1945.json").content
)

Looking at the metadata¶

In [5]:
rfc1945_metadata
{
    'draft': '',
    'doc_id': 'RFC1945',
    'title': ' Hypertext Transfer Protocol -- HTTP/1.0 ',
    'authors': ['T. Berners-Lee', 'R. Fielding', 'H. Frystyk'],
    'format': ['ASCII', 'HTML'],
    'page_count': '60',
    'pub_status': 'INFORMATIONAL',
    'status': 'INFORMATIONAL',
    'source': 'HyperText Transfer Protocol',
    'abstract': ' The Hypertext Transfer Protocol (HTTP) is an application-level protocol with the lightness and speed necessary for distributed, collaborative, hypermedia information systems.  This memo provides information for the Internet community.  This memo does not specify an Internet standard of any kind.  ',
    'pub_date': 'April 1996',
    'keywords': ['HTTP-1.0', 'HTTP', 'World-Wide', 'Web', 'application'],
    'obsoletes': [],
    'obsoleted_by': [],
    'updates': [],
    'updated_by': [],
    'see_also': [],
    'doi': '10.17487/RFC1945',
    'errata_url': None
}

More metadata¶

  • The Internet Engineering Task Force (IETF) provides many more metadata with a very nice interface at datatracker.ietf.org/doc/rfc1945/.
  • The IETF was founded in 1986 to organize all the working groups involved in standards development.
  • For example, at datatracker.ietf.org/doc/rfc1945/referencedby/ you can see which RFCs reference RFC 1945.

Screenshot of <https://datatracker.ietf.org/doc/rfc1945/referencedby/> taken at 9.12.22

Caveats¶

  • According to datatracker.ietf.org/, the metadata for RFCs with numbers smaller than about 1300 and drafts from before 2001 is unreliable and in many cases not available.
  • Standard setting bodies (working groups, etc.), universities & research institutes, and companies are not properly referenced in the metadata.
  • There may be much more information in the documents that is not contained in the metadata.
  • Therefore, we use Natural Language Processing (NLP), more specifically Named Entity Recognition (NER), to extract more metadata.

Named Entity Recognition (NER) with a pre-trained model¶

In [7]:
import spacy
from spacy import displacy

language_model = "en_core_web_sm"

if not spacy.util.is_package(language_model):
    spacy.cli.download(language_model)

nlp = spacy.load(language_model)

Named Entity Recognition (NER) with a pre-trained model¶

  • This language model can extract the following entities by default:
Component Labels
ner CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART
  • It has a precision of 84.11%, a recall of 84.40% and an f-score of 84.25%, a decent but not mind-blowing figure. The best available models have an f-score of 94%.
  • For more information about the model, see github.com/explosion/spacy-models/releases/tag/en_core_web_sm-3.4.1.
In [8]:
doc = nlp(rfc1945_excerpt)
displacy.render(doc, style="ent")
Network Working Group T. Berners-Lee Request for Comments ORG : 1945 CARDINAL MIT ORG /LCS
Category: Informational R. Fielding
UC Irvine ORG
H. Frystyk
MIT/LCS
May 1996 DATE
Hypertext Transfer Protocol WORK_OF_ART -- HTTP/1.0
Status of This Memo
This memo provides information for the Internet community. This memo
does not specify an Internet standard of any kind. Distribution of
this memo is unlimited.
IESG Note:
The IESG ORG has concerns about this protocol, and expects this document
to be replaced relatively soon by a standards track document.

Named Entity Recognition (NER) with a pre-trained model¶

  • Quite a few false positives
  • Low hit rate ("Network Working Group" and "IESG" are not recognized as organizations, persons are not recognized at all, etc.)
  • Very broad categories (e.g. "ORG" for standardization bodies, universities, research institutes and companies)
  • Pre-trained models are not silver bullets and must be adapted to the corpus
  • This also applies to the more accurate (and more energy-consuming) transformer-based models.

Adding custom rules to the NER pipeline¶

In [9]:
ruler = nlp.add_pipe("entity_ruler", before="ner")
ruler.add_patterns(
    [
        {"label": "STD_BODY", "pattern": "Network Working Group"},
        {"label": "STD_BODY", "pattern": "IESG"},
    ]
)
In [10]:
doc = nlp(rfc1945_excerpt)
displacy.render(doc, style="ent")
Network Working Group STD_BODY T. Berners-Lee
Request for Comments: 1945 CARDINAL MIT ORG /LCS
Category: Informational R. Fielding
UC Irvine ORG
H. Frystyk
MIT/LCS
May 1996 DATE
Hypertext Transfer Protocol WORK_OF_ART -- HTTP/1.0
Status of This Memo
This memo provides information for the Internet community. This memo
does not specify an Internet standard of any kind. Distribution of
this memo is unlimited.
IESG STD_BODY Note:
The IESG STD_BODY has concerns about this protocol, and expects this document
to be replaced relatively soon by a standards track document.

Adding more complex rules to the NER pipeline¶

In [12]:
more_complex_pattern
[
    {
        'label': 'STANDARD',
        'pattern': [
            {
                'LOWER': {
                    'IN': [
                        'bcp',
                        'fyi',
                        'ien',
                        'obsoleted',
                        'obsoletes',
                        'request',
                        'rfc',
                        'std',
                        'updated',
                        'updates'
                    ]
                }
            },
            {'LOWER': {'IN': ['for', 'by']}, 'OP': '?'},
            {'TEXT': 'Comments', 'OP': '?'},
            {'IS_PUNCT': True, 'OP': '?'},
            {'IS_DIGIT': True}
        ]
    }
]

Adding more complex rules to the NER pipeline¶

In [13]:
ruler.add_patterns(more_complex_pattern)
doc = nlp(rfc1945_excerpt)
displacy.render(doc, style="ent")
Network Working Group STD_BODY T. Berners-Lee
Request for Comments: 1945 STANDARD MIT ORG /LCS
Category: Informational R. Fielding
UC Irvine ORG
H. Frystyk
MIT/LCS
May 1996 DATE
Hypertext Transfer Protocol WORK_OF_ART -- HTTP/1.0
Status of This Memo
This memo provides information for the Internet community. This memo
does not specify an Internet standard of any kind. Distribution of
this memo is unlimited.
IESG STD_BODY Note:
The IESG STD_BODY has concerns about this protocol, and expects this document
to be replaced relatively soon by a standards track document.

Turning RFC 1945 into a graph¶

  • RFC 1945 can be represented as a graph with nodes and edges.
  • If we are interested in the social network of authors, we represent
    • standards and authors as nodes, and
    • the relationships between standards and their authors as edges.

Adding nodes to the graph¶

We add the standard RFC 1945 as well as all its authors T. Berners-Lee, R. Fielding, and H. Frystyk to the graph.

In [14]:
import networkx as nx

G_rfc1945 = nx.Graph()
G_rfc1945.add_node(rfc1945_metadata["doc_id"])
G_rfc1945.add_nodes_from(rfc1945_metadata["authors"])
print(G_rfc1945.nodes(data=True))
[('RFC1945', {}), ('T. Berners-Lee', {}), ('R. Fielding', {}), ('H. Frystyk', {})]

Adding nodes to the graph¶

In [16]:
graph.show()

Adding edges to the graph¶

We add an edge between the standards and the author for each author.

In [17]:
G_rfc1945.add_edges_from(
    [(rfc1945_metadata["doc_id"], author) for author in rfc1945_metadata["authors"]]
)
print(G_rfc1945.edges(data=True))
[('RFC1945', 'T. Berners-Lee', {}), ('RFC1945', 'R. Fielding', {}), ('RFC1945', 'H. Frystyk', {})]

Adding edges to the graph¶

In [19]:
graph.show()

Properties of this graph¶

This graph has interesting properties:

  • There are two types of nodes: standards and authors.
  • The edges connect authors and standards respectively.
  • There are no edges between nodes of the same type.
  • Thus, it is a bipartite graph and we can draw it as such.
In [20]:
from networkx.algorithms import bipartite

bipartite.is_bipartite(G_rfc1945)
True

Draw the graph as a bipartite graph¶

In [22]:
graph.show()

Adding more standards by Tim Berners-Lee to the graph¶

The IETF provides a database of all RFCs and their authors. We can use this to add all RFCs authored or co-authored by Tim Berners-Lee. See datatracker.ietf.org/person/timbl@w3.org

Screenshot of <https://datatracker.ietf.org/person/timbl@w3.org> taken at 9.12.22

Adding more standards by Tim Berners-Lee to the graph¶

In [23]:
rfcs_by_tbl_metadata = [
    json.loads(requests.get(f"https://www.rfc-editor.org/rfc/{rfc}.json").content)
    for rfc in [
        "rfc1630",
        "rfc1738",
        "rfc1866",
        "rfc2068",
        "rfc2396",
        "rfc2616",
        "rfc3986",
    ]
]
G_tbl = nx.Graph()
for rfc in rfcs_by_tbl_metadata:
    for author in rfc["authors"]:
        G_tbl.add_edge(rfc["doc_id"], author)

Adding more standards by Tim Berners-Lee to the graph¶

In [25]:
graph.show()

From authorship to co-authorship¶

  • Our bipartite graph tells us who wrote which RFC. It can tell us who wrote the most RFCs. This is something that we can do with a spreadsheet. We don't need a graph for that.
  • We can also turn the bipartite graph into a unipartite graph to find out who co-authored the most RFCs. This is something that we can't do with a spreadsheet. For this we need a relational data model, for example a graph.

Turning a bipartite graph into a unipartite graph¶

  • To turn our bipartite authorship graph into a unipartite co-authorship graph, we draw edges between all pairs of authors who co-authored a RFC. This is often referred to as projection.
  • RFC 1945 was authored by Tim Berners-Lee, Roy Fielding and Henrik Frystyk Nielsen. This means that we draw a graph with edges between Tim Berners-Lee and Roy Fielding, Tim Berners-Lee and Henrik Frystyk Nielsen, and Roy Fielding and Henrik Frystyk Nielsen.

Co-authors of RFC 1945¶

In [27]:
graph.show()

All co-authors of Tim Berners-Lee¶

  • We repeat this process for all RFCs authored or co-authored by Tim Berners-Lee (i.e., for RFC 1630, RFC 1738, RFC 1866, RFC 2068, RFC 2396, RFC 2616, and RFC 3986).
  • Thus, we obtain a graph with all of Tim Berners-Lee's co-authors.

All co-authors of Tim Berners-Lee¶

In [29]:
graph.show()

All co-authors of Tim Berners-Lee¶

  • We can indicate the number of co-authored RFCs by the thickness of the edges.
In [31]:
graph.show()

All co-authors of Tim Berners-Lee¶

  • Tim Berners-Lee has collaborated most with Roy Fielding and Larry Masinter (four times each).
  • But are they equally important? It is hard to tell from this graph.
In [33]:
graph.show()

Analyzing the graph¶

A very basic indicator of the importance of a node is its degree (i.e., the number of edges). The degree centrality for each author is the proportion of co-authors to which the author is connected. It gives an indication of how important an author is in the network. We selected only RFCs that were (co)authored by Tim Berners-Lee (a so-called ego network). He has collaborated with all other authors, so the degree centrality of Tim Berners-Lee himself is 1.

$$ \text{centrality degree}(author) = \frac{\text{degree}(author)}{\text{total number of authors} - 1} $$
In [35]:
degree_centrality
Out[35]:
author degree centrality
0 T. Berners-Lee 1.000
1 L. Masinter 0.875
2 P. Leach 0.750
3 R. Fielding 0.750
4 J. Mogul 0.750
5 H. Frystyk 0.750
6 J. Gettys 0.750
7 M. McCahill 0.250
8 D. Connolly 0.125

Analyzing a bigger network¶

We analyze all rfcs that were co-authored between 1994 and 1999. The network is much bigger and more complex.

In [38]:
graph.show()
'Graph with 1141 nodes and 2201 edges'

A word about graph components¶

  • A component is the set of nodes that are connected to each other.
  • A graph can have multiple components.
  • Each component must be analyzed separately.
  • Our graph has 207 components.
  • Most of the components are very small (243 have less than 15 authors, 133 have only 1 author).
  • Over 57% of the authors are in the largest component.
  • We will focus on the largest connected component.

Analyzing the largest connected component¶

In [42]:
graph.show()
'Graph with 651 nodes and 1758 edges'

Analysis of the largest related component¶

  • We have 651 authors in the largest component.
  • The degree distribution is skewed to the right. Most authors have a low degree. A few authors have a very high degree.
  • This means: most authors are not very well connected.
  • We can assume that there are several communities (i.e., groups of authors that are well connected).
  • How can we find these communities?

Analyzing the community structure¶

We use a community detection algorithm to identify communities in the graph.

In [46]:
G_94_99_uni_giant_communities
Out[46]:
community author count
0 0 96
1 1 86
2 2 60
3 3 59
4 4 46
5 5 36
6 6 34
7 7 32
8 8 30
9 9 27
10 10 26
11 11 25
12 12 24
13 13 20
14 14 19
15 15 13
16 16 9
17 17 5
18 18 4

Analyzing the community structure¶

  • How do we find the important members of the communities?
  • We calculate centrality measures for all co-authors in the graph.
In [48]:
G_94_99_uni_giant_author_measures.round(3)
Out[48]:
author degree centrality betweenness centrality eigenvector centrality pagerank community
0 S. Deering 0.051 0.121 0.282 0.007 0
1 Y. Rekhter 0.051 0.142 0.125 0.008 1
2 L. Zhang 0.045 0.052 0.259 0.005 0
3 F. Baker 0.043 0.134 0.090 0.006 0
4 V. Jacobson 0.042 0.076 0.261 0.005 0
... ... ... ... ... ... ...
646 C. Burton 0.002 0.000 0.000 0.000 4
647 M. Beadles 0.002 0.000 0.000 0.000 15
648 D. Perkins 0.002 0.000 0.000 0.001 16
649 T. Henderson 0.002 0.000 0.014 0.000 0
650 D. Stenerson 0.002 0.000 0.000 0.000 3

651 rows × 6 columns

A word about centrality measures¶

  • Degree centrality: For finding very connected individuals, popular individuals, or individuals who can quickly connect with a wider network.
  • Betweenness centrality: For finding individuals who are likely to be the "bottleneck" in the network, i.e., individuals who are likely to be the most important in connecting the network.
  • Eigenvector centrality and PageRank: For finding individuals who are likely to be an authority in the network, i.e., individuals who are likely to be the most important in connecting the network.

Analyzing the community structure¶

We identify the "leaders" of the communities by selecting the author with the highest eigenvector centrality.

In [50]:
G_94_99_uni_giant_communities_leaders.round(3)
Out[50]:
author count author eigenvector centrality
community
0 96 S. Deering 0.282
1 86 Y. Rekhter 0.125
2 60 J. Postel 0.065
3 59 M. Kosters 0.007
4 46 B. Carpenter 0.029
5 36 J. Mogul 0.021
6 34 W. Simpson 0.000
7 32 M. Allman 0.036
8 30 R. Hinden 0.069
9 27 H. Schulzrinne 0.031
10 26 H. Alvestrand 0.012
11 25 G. Parsons 0.004
12 24 P. Hoffman 0.001
13 20 C. Perkins 0.010
14 19 A. Smith 0.001
15 13 B. Aboba 0.000
16 9 P. Vixie 0.001
17 5 A. Orda 0.003
18 4 R. Coltun 0.006

Analyzing the community structure¶

  • Can we say something about the relationship between the communities?
  • If we assume that two communities are closer to each other if their leaders are connected, we can calculate the distance between the leaders.
  • A distance of 1 means that the leaders are co-authors. A distance of 2 means that the leaders are coauthors of a co-author. A distance of 3 means that the leaders are co-authors of a co-author of a co-author. And so on.
  • To calculate the distance between two leaders, we use the shortest path algorithm.
In [52]:
most_important_authors_shortest_path_matrix.astype(int)
Out[52]:
S. Deering Y. Rekhter J. Postel M. Kosters B. Carpenter J. Mogul W. Simpson M. Allman R. Hinden H. Schulzrinne H. Alvestrand G. Parsons P. Hoffman C. Perkins A. Smith B. Aboba P. Vixie A. Orda R. Coltun
S. Deering 0 1 1 2 2 1 3 2 1 2 2 3 3 2 4 6 3 3 3
Y. Rekhter 1 0 1 2 1 2 3 3 1 3 3 3 3 3 4 6 2 4 3
J. Postel 1 1 0 1 2 2 4 3 1 3 3 3 3 3 4 7 3 4 3
M. Kosters 2 2 1 0 3 3 5 4 2 4 4 4 4 4 5 8 3 5 4
B. Carpenter 2 1 2 3 0 2 4 3 1 3 2 4 2 4 4 7 3 4 4
J. Mogul 1 2 2 3 2 0 4 3 2 3 2 4 2 3 5 7 4 4 4
W. Simpson 3 3 4 5 4 4 0 4 3 5 2 4 4 5 5 3 5 5 4
M. Allman 2 3 3 4 3 3 4 0 3 3 3 3 5 4 4 7 5 3 4
R. Hinden 1 1 1 2 1 2 3 3 0 3 2 3 2 3 4 6 3 4 3
H. Schulzrinne 2 3 3 4 3 3 5 3 3 0 4 4 5 2 5 8 5 4 4
H. Alvestrand 2 3 3 4 2 2 2 3 2 4 0 2 2 4 3 5 4 5 3
G. Parsons 3 3 3 4 4 4 4 3 3 4 2 0 4 5 4 7 5 5 3
P. Hoffman 3 3 3 4 2 2 4 5 2 5 2 4 0 5 5 7 5 6 5
C. Perkins 2 3 3 4 4 3 5 4 3 2 4 5 5 0 5 8 5 5 4
A. Smith 4 4 4 5 4 5 5 4 4 5 3 4 5 5 0 8 6 4 3
B. Aboba 6 6 7 8 7 7 3 7 6 8 5 7 7 8 8 0 8 8 7
P. Vixie 3 2 3 3 3 4 5 5 3 5 4 5 5 5 6 8 0 6 5
A. Orda 3 4 4 5 4 4 5 3 4 4 5 5 6 5 4 8 6 0 4
R. Coltun 3 3 3 4 4 4 4 4 3 4 3 3 5 4 3 7 5 4 0

The distance between two communities and their "leaders"¶

  • S. Deering, the "leader" of community 0, is connected to most of the other communities.
  • B. Aboba, the "leader" of community 15, is very isolated.
In [54]:
most_important_authors_shortest_path_matrix_color.show()
community 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
community                                      
0 0 1 1 2 2 1 3 2 1 2 2 3 3 2 4 6 3 3 3
1 1 0 1 2 1 2 3 3 1 3 3 3 3 3 4 6 2 4 3
2 1 1 0 1 2 2 4 3 1 3 3 3 3 3 4 7 3 4 3
3 2 2 1 0 3 3 5 4 2 4 4 4 4 4 5 8 3 5 4
4 2 1 2 3 0 2 4 3 1 3 2 4 2 4 4 7 3 4 4
5 1 2 2 3 2 0 4 3 2 3 2 4 2 3 5 7 4 4 4
6 3 3 4 5 4 4 0 4 3 5 2 4 4 5 5 3 5 5 4
7 2 3 3 4 3 3 4 0 3 3 3 3 5 4 4 7 5 3 4
8 1 1 1 2 1 2 3 3 0 3 2 3 2 3 4 6 3 4 3
9 2 3 3 4 3 3 5 3 3 0 4 4 5 2 5 8 5 4 4
10 2 3 3 4 2 2 2 3 2 4 0 2 2 4 3 5 4 5 3
11 3 3 3 4 4 4 4 3 3 4 2 0 4 5 4 7 5 5 3
12 3 3 3 4 2 2 4 5 2 5 2 4 0 5 5 7 5 6 5
13 2 3 3 4 4 3 5 4 3 2 4 5 5 0 5 8 5 5 4
14 4 4 4 5 4 5 5 4 4 5 3 4 5 5 0 8 6 4 3
15 6 6 7 8 7 7 3 7 6 8 5 7 7 8 8 0 8 8 7
16 3 2 3 3 3 4 5 5 3 5 4 5 5 5 6 8 0 6 5
17 3 4 4 5 4 4 5 3 4 4 5 5 6 5 4 8 6 0 4
18 3 3 3 4 4 4 4 4 3 4 3 3 5 4 3 7 5 4 0

Recap¶

Why do I want to turn the corpus into a network?

  • Visualize, explore, and understand the corpus.
  • Identify important documents, actors, and other entities.
  • Recognize relationships among documents, actors, and other entities
  • Recognize clusters of documents, actors, and other entities
  • Recognize evolving patterns in the corpus
  • Form and test hypotheses about the corpus
  • Develop research questions

The one book everyone should read¶

  • Jackson, Matthew O. The Human Network: How Your Social Position Determines Your Power, Beliefs, and Behaviors. Knopf Doubleday Publishing Group, 2019.

Thank you for your attention!¶

For questions, comments, and feedback, please contact me at moritz.maehr@unibe.ch