data praxis: a could be common cause

Thomas Padilla
16 min readMay 22, 2018

--

* delivered at Data Intersections: University of Miami, March 2 2018 *

Thank you to Paige, Alberto, and Cameron for a wonderful event. I missed DH+DJ the last time around and really regretted it. I feel honored to participate in a conversation that brings the work of librarians, data scientists, journalists, and statisticians together. For my part I’ll be talking about something I call “data praxis” and I’ll work toward framing it as a “could be common cause” that joins our work.

I do want to flag that I will be discussing some difficult content during my talk, made especially challenging given the recent shooting in Parkland. To be crystal clear there will be discussion about two mass shootings. If you feel the need to leave I understand. I will flag this content once more immediately prior to discussion.

I’ll move across 4 topics during this talk.

  1. First, I’ll stage the concept of data praxis and I’ll elaborate on a “could be common cause”.
  2. Second, I’ll introduce the need for data praxis in the “here and now”.
  3. Third, I’ll introduce the concept of collections as data in libraries and I’ll suggest that it is a prime vector for collaboration between us.
  4. Finally, I’ll close with some concrete suggestions that pave the road to our collective ability to cultivate a broad data praxis.

When I talk about agency and self actualization I am referring to a broad cross section of society having the skills and dispositions to critically engage with the data driving the stories that we tell. In times like these we cannot afford for the risk of being relegated beneath the banner of an academic fetish or professional spellcasting.

In my day to day job as a librarian I seek to cultivate the ability to explore the world — what it has to offer AND also what it might try to take away.

This is a responsibility that plays itself out across disciplines and at multiple levels of scholarly and professional inquiry.

For the past couple of years a fair amount of my time has been spent trying to help people to see data in the world around them.

To come to an understanding that apparent surrogates of physical experiences rendered in digital spaces are much more than they appear to be.

Much of this conversation hinges on taking commonly understood things like books and casting them as data.

Conceptual moves supported by discussion of recent work by Underwood, Bamman, and Lee that attempts to evaluate characterization of gender across more than 170 years of english language fiction.

Similarly I cast the seemingly humble encylopedia, Wikipedia, as a highly mutable site of contest.

I illustrate that contest through discussion of experimental work that leverages Wikipedia as data. For example, Ed Summers’ work with Congressional Edits — a Twitter bot that monitors edits to wikipedia associated with the IP ranges of Congress.

Its a real time report on a battle for meaning and is suggestive of a broader field of contest humming below the surface of seemingly mundane interactions on the web.

You could say that my work is largely focused on the cultivation of possibility.

I’ll turn now to discussion of how I see this work given the challenges of the “here & now”.

Cultivation of possibility as a job seems pretty awesome, and it is. I enjoy it.

However since that development on the right, the types of orientation to data that I’ve been trying to cultivate for the past few years have taken on a new urgency.

Collectively, common understanding is under assault.

And that assault is being waged in such a way that countering it cannot come without a commonly held understanding of a theory and practice of data.

Without that common understanding, at an individual and a societal level, we are lost.

And as we are lost, it follows that we lose the ability to identify and build upon the worlds that we would like to achieve.

We cede possibility rather than participate in its continual realization.

In September of last year, I was working as a Humanities Data Curator at the University of California Santa Barbara, at the edge of the Pacific ocean.

Toward the end of September I made my way with my partner to settle at the bottom of a former sea in the Mojave desert.

More specifically, I moved to Las Vegas.

As promised, I want to flag that I am going to discuss sensitive content.

Specifically I am going to talk about two mass shootings. I’ll hold for a moment.

My first day of work at the University of Las Vegas was the day after the October 1 shooting, in which 58 people were killed, and more than 800 were injured by a lone white male using multiple AR-15s. Motive unknown.

Fairly soon after the event UNLV Libraries along with cultural heritage organizations in the area began a concerted effort to document the event for public memory and future research.

When an event like this happens, what people don’t often think of is the weight of evidence that is generated.

It is a physical weight. And an emotional one too — for the archivists and librarians charged with holding these materials — an acquisition that often entails years of daily interaction with trauma.

A librarian colleague, Ashley Maynor, now at NYU produced a documentary called, “Story of the Stuff”. Story of the Stuff tells the story of the physical legacy of the Sandy Hook shooting. Maynor’s work on the Sandy Hook documentary was sadly motivated in part by her desire to process her own experience at Virginia Tech during the mass shooting there a few years prior.

Of course in addition to the physical legacy of these tragedies, increasingly we need to deal with the data they leave in their wake.

At UNLV I was responsible for collecting Twitter data associated with the October 1 shooting.

Many millions of tweets in this dataset. A number of staccato statements that stretch the imagination.

As I processed them the most tangible takeaway I had was the feeling of the emotions attached to each utterance.

In light of that experience I created the following visualization of the full run of emojis in this dataset.

More than 800,000 so far. 23 minutes and 9 seconds to run the course.

I could drown in this data.

Its difficult to imagine the experience of the person whose daily job isn’t data.

And yet, this is our cultural record.

I am part of a concerted professional response in the archival profession to develop best practices for documenting events of this kind.

I will say, it’s a sad day when you wake up to your first national tragedy task force meeting.

I’ll leave it to your imagination the reasons and experiences that each member of the task force brings to the work.

The effort of the task force is primarily geared toward developing best practices for documenting these tragedies.

It is less focused on developing the means to work with and derive meaning from the data that these events leave behind.

For me, this is a gap that I would like to see filled with a collaboration between digital humanists, data scientists, statisticians, and data journalists.

By combining our effort I think we can make significant contributions to a broadly realized form of data praxis — something that is integral to helping people process these events and the stories that are told in their aftermath.

To start, I would like us to collectively work on these questions together.

How exactly will our collaboration play out? That is an uncertain question.

Its uncertain in the sense of aligning effort, but its also uncertain in the sense that I’m not sure any of us have figured out how to naturalize the theory and practice of working with data more broadly.

Not having the exact answer should not be a barrier to us getting started.

The Braess’ paradox aligns with Floridi’s comment.

Time and again the paradox holds that creation of the most direct path to solve a given problem actually creates a more inefficient system.

The paradox suggests that a multiplicity of solutions grounded by a healthy degree of uncertainty may actually be ideal.

For me the paradox suggests a great deal of possibility.

I’ll turn now to introduce the concept of collections as data and the work that cultural heritage organizations like libraries, archivies, and museums have been dedicating to it.

I believe that it is a primary vector for collaboration between us all.

When I talk about collections as data, I am referring to all of the collections a cultural heritage institution holds — digitized and born digital.

This is inclusive of books, artwork, and sound recordings.

It is also inclusive of the less commonly used products of our contemporary knowledge environment — web archives, social media data, 3D renderings, software, and code.

Collections as data are ordered information, stored digitally and given the prior two conditions they are inherently amenable to computation.

The concept seeks to foster an orientation to collections that makes it possible to work with them computationally.

using methods and approaches including but not limited to text mining, data mining, network analysis, and machine learning.

The work is guided by sentiments like those expressed by Martin Mueller, “every surrogate has its own query potential which for some purposes may exceed that of the original.”

I do a lot of advocacy work in this space along with Laurie Allen, Stewart Varner, Sarah Potvin, Hannah Frost, and Elizabeth Russey Roke via an IMLS supported project called, “Always Already Computational: Collections as Data”.

We seek to develop the means for a broad range of cultural heritage organizations to develop and provide access to collections as data.

As an example of that work, I’d like to refer to MIT Libraries experimentation with providing a text and data mining service for its theses and dissertations.

Why an API for theses and dissertations?

First, the library was witnessing attempts to scrape collections whose infrastructure was not designed to support that level of access.

Motivations largely driven by the perception of potentially significant intellectual property within the vast array of research generated over the years.

Second, senior administrators across disciplines on campus requested that the library provide portal-like access to better facilitate computational use of the collections.

And thus an experiment in data processing and API development was born. I’d encourage you to take a look and Im sure that folks there would be hapy to hear your feedback.

Of course not all of this work is so resource intensive.

Nor is it so heavily oriented to some of the less accessible forms of computational work.

Your colleagues here at the University of Miami have also charted an initial experiment with collections as data via generation of a Spanish language dataset derived from a historic newspaper held at the university.

I imagine that this could be prime material to support introductory experimentation with computational methods in the undergraduate classroom.

In the Always Always Computational project, we realize that this work constitutes a technical AND social challenge.

Its not simply a matter of sorting out API specifications.

Rather, it requires critical conversations with a range of stakeholders to identify challenges and opportunities that align with specific community needs.

We’ve dedicated a lot of effort to the social and technical dimensions of this challenge through conversations with representatives from a diverse range of institutions, scholarly, and professional societies.

NICAR is right around the corner and we are really excited about the opportunity to learn more from journalists.

Going back in time a bit — to 2015 — one of the primary inspirations for Always Already Computational was some work that Jeremy Singer-Vine and colleagues at Buzzfeed News were doing with reproducible data journalism.

In the academic space reproducible and data often go together so that wasn’t particularly novel.

What sparked my interest was whether this kind of journalism anticipated data fluency demand across a broadly conceived public.

It had me thinking:

Will the reader, akin to the academic researcher, come to expect access to data and code driving a journalistic output?

Will they want to run scripts in an ipython notebook to test core claims in a piece?

And finally I thought:

Who is this reader?

Who would want such a thing?

Who would be responsible for helping them — conceptually and practically to identify that they might want such a thing — to help them to make use of it?

Shifting back to the university community, we find corollaries with Buzzfeed’s work in the context of the Humanities classroom.

I’ll to talk briefly about Tiffany Chan’s, “The Author Function”.

Chan, an MA student in English, leverages machine learning to experiment with imitating the style of Grant Allen.

The research is interesting in its own right, but for our purposes I am more interested in focusing on the process.

Looking over the conventions for documentating process and interpretation we see marked similarities between Singer-Vinve’s work at Buzzfeed.

Chan provides access to the code — she documents in detail the functions of her scripts and both the promise and the pitfall of neural networks.

Chan closes with detailed discussion of the primary data underlying her analysis.

She used a cultural heritage collection as data to drive this work forward.

What I see in the Singer-Vine + Chan comparison is the potential for libraries to support the sandboxing of a theory and practice for working with data that broadly prepares diverse members of our society … to engage with mass communication that centers data and process that supports claims.

I think we could all probably agree that evidence and the means to interpret it are sorely lacking in the current climate.

The consequences of this lack are staggering in scope, with the potential to touch us all.

By combining forces we can work in common cause against this.

Finally, I’ll turn toward a discussion of the kinds of partnerships that I would like to see.

I would like for us to partner more on joint experimentation to evidence what might be possible with data.

The Library of Congress, for example, has hosted a researcher in residence as well as an innovator in residence recently.

Experimentation produces neat outputs but the lasting effect goes beyond the initial shine.

In my view experimental work geared toward public audiences serves to promote interest in the the theory and practice of working with data.

Jer Thorp produced the following experiment which draws on data about Library of Congress books.

By knowing how to see potential in data, Jer was able to create a new form of interaction with the collection.

This is an example that sparks the imagination — it causes us to wonder what else might be possible and in so doing it holds the potential to motivate a path toward learning more about a theory and practice of working with data.

We desperately need to partner on cultivation of data ethics.

Lets bring folks like Bergis Jules (University Archivist at UC Riverside) and Yvonne Ng (Witness) together with data scientists, journalists, and statisticians.

Lets foster partnerships that center the needs of those communities that are so often studied and written about but rarely if ever given a voice.

Also, I shouldnt need to say this, but these people need to be compensated for this work.

Ethics is expertise.

Pay them.

Full stop.

Bergis Jules works on a powerful project called Documenting the Now, a social media archiving effort that rose amid the Ferguson protests following the death of Michael Brown.

Tools developed by the project have subsequently been used to document the Bataclan terrorist attack in Paris, the Womens March, Black Lives Matter, and the October 1 Shooting in Las Vegas.

Beyond merely capturing the data, folks like Bergis try to sort out an ethical approach to data that Institutional Review Boards generally offer little assistance with.

I’d like to see more partnerships that focus on documenting and meeting community needs.

MIT provides a leading example.

The MIT Public Library Innovation Exchange — a media lab initiative — aims to partner with public librarians to develop new creative learning programs together.

Huge shoutout to the Knight Foundation for supporting that.

Finally, I’ve argued for this in another context, and I’m not really sure how popular it is, but I’d love to see more funders support cross sector staff exchange in the United States. This is something supported in the EU through a Marie Sklodowska-Curie action. It allows for temporary staff exchanges between academic and private sector entities.

Simply put, we need more knowledge of eachother.

Speaking for the academic side, we are often very much on our heels trying to make sense of data trafficked on private sector platforms like Twitter and Facebook. Struggling to develop tools, stay abreast of API and data model changes, and moving in an indeterminate space when it comes to policy.

How can we help our users in this space without more knowledge?

We need to know more about these cornerstone social platforms that capture and convey so much of the contemporary cultural record.

I said data a lot during this talk. I mentioned things like text mining, data mining, and network analysis.

Despite my goals to be clear I’ve probably veered in and out of pretty academic language.

What might we call this conversations that we’ve been having?

To some degree I think thats an academic question that misses the point.

We must remain vigorously focused on the challenges of the present.

Data is all around and it cant remain the case that we are the only ones that understand it.

With that said I really don’t care if its called data praxis or not so long as whatever we decide to call it collects our effort in a way that increases agency.

Agency for who? Agency for all.

Thank you.

--

--

No responses yet