Tutorial overview

In this tutorial we will learn to:

  1. Search Open Context (OC) for a specific data set
  2. Explore the data available on OC to locate a particular kind of material
  3. Download one, and then a few more, of them
  4. Open the file to understand what it contains
  5. Read the structure of the material to evaluate how the data in it might be transposed
  6. Consider the data types within the document
  7. Create a table based on the data in the sheet
  8. Fill in the table based on the data in the sheet
  9. Select a different data sheet and fill it in based on the organizational / table schema we created
  10. See what patterns emerge

Key concepts used in this tutorial

data structure

note taking

paperwork or documentation

context sheets

columns, attributes, headings

entries, rows, observations

data types

string

numeric

integer

Introduction

Welcome (back) to our digital data stories! In this tutorial, we’re going to look at something more disorganized than some of our other tutorials. Rather than download a ready-to-go data set, we’re going build one ourselves. Building data sets can be a complicated task. However, so we’ll walk through all the steps together before we try to do it on our own.

We might be wondering, why would we want to do this? With so many archaeological data sets out there, why would we bother creating our own? Well, as much as we like the well-curated collection of data sets that Open Context has, data sets like that are not that common.

Much more likely is that whenever we start a project, we’ll need to decide everything about that project. We’ll need to decide how we collect our data, how we observe things, as well as a bunch of other logistics. Since we can’t run a whole archaeological excavation through tutorials (at least not yet) we can skip some of those parts to focus on an under valued part of archaeology, paperwork.

Picking a project

We’re going to leave aside some of the tougher decisions as far as data collection are concerned and skip ahead to the data collation part. With that we can go from Notes (aka paperwork) to Data and then from that Data to a narrative about a site. To do this we’ll be using context sheets from the site of Gabii located in Italy.

Before we can do this though, we’ll need to search OC for this particular project. We’ll do this by going to the website first (www.opencontext.org) and putting Gabii into the search bar at the top of the page.

This is where we can find the search bar on the OC website

Once we put Gabii into the search bar, and clicked search, we should be navigated to a page that looks a bit like this:

Here’s what it should look like once we’ve searched for Gabii

Here we are going to scroll down to a section that is called Filtering Options and select a few of these options.

This is what the filtering options look like

Specifically, we’ll want to open the Project option and select or click The Gabii project. This will take us to another page that has limited all the entries just to The Gabii project, which was what we were looking for.

On that page, we’ll scroll down to those filtering choices again and be even more specific about our needs. We’ll want to open the Descriptions(Project Defined) option. This will give us the different ways that the project defined some of the descriptions associated with the project. Under that, we’ll want to select or click File Attachment Type to filter the data. This will, once again, bring us to a new page.

The filtering option will look like this when we click or select 'File Attachment Type

There, we’ll scroll down and look for the File Attachement type option. While it might seem tedious to click, scroll, click, and scroll, these different layers introduce the many different ways that data in OC are associated, which is part of the power of the tool.

The filtering option for Context Sheet will look like this before we click or select it

Under that filtering option, we’ll select the the specific kind of data that we want, the Context Sheet. We’ll click or select this again to get to our final selection. This will take use to the page with just the context sheets for the Gabii project. This time when we scroll down to the bottom of the page we’ll get a list of all the context sheets for the project.

This is what the list of context sheets will look like

Also if we want to skip ahead or accidentally mess up the paths, or go down an OC rabbit hole, we can go straight to the context sheets from this URL https://opencontext.org/search/?proj=104-the-gabii-project&prop=104-file-attachment-type---104-context-sheet&prop=oc-gen-image&q=Gabii&rows=20&type=media#18/41.88878/12.71939/20/any/Google-Satellite.

However, now that we’re here we’re going to want to click on one of the hyperlinked Item Label numbers. This will take us to the page for just that context sheet and have a button that says Download File. We’ll click that and that will take us to a view of the PDF where we can download the sheet to a safe location. Once that’s all done, we now have one of the context sheets for further investigation.

This is what the download button will look like on the new page

For reference, the images for this tutorial were from 13754 the first context sheet in the list. However, we can pick any context sheet in the list we’d like and will necessarily look at more than one. So we can have fun picking different PDFs to explore!

Kinds of data

Awesome! We got a context sheet. This is an example of one way that the project decided to structure their note-taking, by creating a form. It looks like they also took photos and They probably thought about this before the project started and hopefully they also thought about how those forms will translate into another common structured and organized format, the table.

If we’ve gone through some of the other data stories written by the Data Literacy Program of AAI, we may be familiar with tables or tabular data. However, instead of utilizing pre-made tabular data for this tutorial, we’ll be making some ourselves based on those context sheets from Gabii that we found.

It’s common for legacy projects to not have things converted to tables yet so even though this project probably has these data entered in a nice table or database, the practice of turning a form into a structure table is really important.

The structure of the context sheet

Take a some time to explore the PDF of the context sheet that we downloaded. How many different sections are there in this particular form? Are they related to each other in any particular way? Are some portions not filled out? Are some portions crossed out? How many of these sections provide specific options? How many look like they are numeric, in that people only put numbers in the box? How many look like someone just wrote whatever? Are there any images?

When we look at a form, it’s important to understand the limits of what it contains. No matter what questions we have, we can’t ask them if the excavators didn’t note that information. So it’s really important to get to know what is there before we move forward.

While some projects will already have tables ready to go, with all the correct sections well labeled and typed into headings out, or even more detailed a digital form that will take all the same information, many projects don’t start out that way. This is why it is important to look at a couple of different sheets before committing to a particular kind of structure. That way we can make sure that all the important note-taking elements are captured.

However, there is a lot going on in this form. Also this is a tutorial, we’re not actually accumulating this table for anything but practice right now. Unless we’re a member of the Gabii Project, then yes, we should probably do all the headings. We probably aren’t though so let’s take a look at the headings and use a few as examples for how to turn these notes into structured data!

Selecting our headings

We can keep a list of these headings anywhere but for organizations sake it might be a good time to open up our trusted (or not trusted as all corporations are evil), spreadsheet the program to type these up. Or* if we want, we can just get some lined or graph paper and make this into a physical sheet. Some of our other evaluations might be more difficult to do but transposing things into a physical table can be an really great exercise to practice the mechanics of these transpositions.

Looking at the form it looks like the best headings are the WORDS IN ALL CAPS AND BOLD that seem to be associated with particular boxes on the form. These seem to be the major headings. However, some boxes also have Words In Camel Caps That Are Italicized And Bold as well the other ones. These are probably subheadings and we’ll want those to be separated into their own columns as well. We may also notice that some boxes have some other things going on.

Now that we’ve looked at the various possible boxes we can use as headings, we’re going to focus on the ones at the top of the Context sheet, SOIL/MATRIX, DESCRIPTION with its two possible sub sections, and lastly the box labeled INTERPRETATION. Forms are really important for doing archaeology, however we want to practice organizing the data, not necessarily repeat data entry that’s already been done.

Once we’ve got our list of headings for those sections, with more headings probably than the number of boxes we had, we can put them into an actual table format to move towards our goals.

Making our table

Once that’s open, or we’ve located some awesome paper products, we’ll type the headings that we decided on into the first row in the spreadsheet. If we’re comfortable we could also start entering this into a database program. If we aren’t familiar with databases, no problem! We’ll explore those in the future, so for now just stick with the spreadsheets or physical tables you are comfortable with.

Adding our headings

So we’ll put the first box from the form SITE into the first column first row, followed by YEAR in the next column same row, then AREA, and then SECTOR.

However, when we get to ELEVATION we’ll need to take a short interlude. Within ELEVATION we have two pieces of info, Min: and Max:. As these both have to do with elevation it’ll be better to have these as two columns. We’ll call these ELEVATION_MIN and ELEVATION_MAX.

Then we get to STRATIGRAPHICAL UNIT. This looks like it has two pieces of information as well but they are structured differently. Rather than both be numeric, it looks like there’s a section that’s probably usually a number and then a section that has specific options. We’ll also break this into two boxes STRATIGRAPHICAL UNIT NUMERIC for now, we may want to alter that title after looking at other context sheets, and STRATIGRAPHICAL UNIT TYPE, which will capture the check boxes within that section.

As we’ve already seen, DESCRIPTION and INTERPRETATION will probably need to be broken into separate boxes as well. DESCRIPTION already has at least two sections,

Deciding each headings data type

Congratulations we’ve made some very important decisions regarding how we’re going to structure the data collection for this site. Separating those boxes on the form into different column headings can be tough because on the one side, fewer columns might make it faster to enter things. On the other hand more detail can be really important for seeing patterns.

Once we’ve got our headings figured out we’ll need to decide what to call them and how we might want to restrict entry to them. First write or type out the headings into our place where we’ll be collecting tabular data. Since we’re not using a database yet, we’ll need to be careful (and explicit) about the kinds of information we’ll want in the table.

We can do this for consistency’s sake by adding a README table that explains a bit about what should be entered in the table and what it holds. These can be very useful if multiple people are doing data entry or if there is likely to be large gaps between data entry events.

However, this is a tutorial so we don’t feel pressured to do that necessarily. Though it is always good practice.

Ok, back to the note-able parts. Once we’ve got our headings, to remain consistent with how we want to enter data we should think about the kinds of information that go into the boxes. Remember how we were wondering about things like if the notes taken were numeric or not? Well those are important because want to be consistent with how those observations are recorded.

The first step will be: is the observation a number or not a number? Is it 2, B (in separate boxes) or not 2B (in one box).

Once we’ve decided that, there are different directions we’ll need to go. Let’s take the Elevationarea of the form is the information contained in that a number? If yes the next question is, is it a whole number or one with a decimal point. Why does this matter? Well, reasons, good reasons, reasons that are slightly more detailed than I want to put down right now.

Let’s just say, big number require a lot more memory to store and they’re even more complicated if they have decimal points and are very large numbers. Furthermore, if we want to do some quick exploratory data analysis in our spreadsheet, we want to make sure all the data types are the same. Otherwise, we might try to find the average of 1 and fish and that just won’t go anywhere.

Here’s what the SOIL/MATRIX box looks like

If we take a look at the SOIL/MATRIX box, which may or may not be filled in on our context sheet, that one section might have more than one heading associated with it. That’s ok! What we’ll want to do with that is as we’re listing out all the headings, we’ll just have one that is soil/matrix plus something that denotes specifically what information we’re recording. We’ll want to make sure we keep these subheadings using a standard name set as well so we can quickly see which headings are part of a larger whole.

Good practice in archaeology is to use more column names or headings rather than less. Why? Well we can always recombine data if we need to. Going in the opposite direction is possible BUT typically causes more errors as data entry done in that fashion tends to involve using a lot of string data types, which are more complicated to work with. Also the more we ask people to type things the more likely we are to cause errors due to spelling or differences in the number of spaces or capitalization. All of which are a problem if we want to be consistent.

These are the headings at the top of the context sheet that we’ll be using

The basic rule though is that if it’s a number, use numbers to symbolize them. For example, if there’s a box called STRATIGRAPHICAL UNIT. We might not know exactly what that is but by taking a look at our other context sheets it’s pretty obvious that primarily only numbers go in there. If they’re whole numbers, those are what we in the under caffeinated programming field call “integers”. If they aren’t whole numbers, aka there’s a decimal point or comma those are not integers. Instead, they are a more complicated kind of number that we will deal with in the future.

This is what the Stratigraphic unit box looks like

However, we also noticed in this STRATIGRAPHICAL UNIT box that there are a couple of words with little boxes next to them. The words are very important as well so we’ll want to capture them.

However, there are only two options for that. So we’ll want to make sure we standardize how those two options can be listed but that will depend on whether those options are an one or the other, or a possibility where both states can be true at the same time.

Data types [posssibly cuttable…]

Before we go too far, let’s discuss something called data types. What do we want to know and how do we store it?

If we open up our PDF we’ll see that it looks like a pretty standard form. It has pencil and pen marks denoting particular observations about the context of belongings at the site. What’s cool about forms is that rather than just get whatever the excavator happens to notice written down, forms guide our observations and formalize the series of observations we make.

This is what the Photos section of the 104_13754 Context Sheet looks like

Some of these like Photos are very specific. It wants to know whether or not there were photos taken. It then also includes a space to put down the photo numbers. Others, like Description, on page one, and Interpretation, on page two, are more variable and leave space for excavators to take unstructured notes.

These possibilities help us determine what our data types in the table will be. We can think of each space on the form as a heading, field, or attribute for the tabular data we are going to build. These are the specific observations we want to note.

Once we’ve decided those, the data within that box is the specific entry, row, or observation for the context sheet we’re working with. These refer to the specific thing we are observing and the particular instance of the attributes it had.

This is where we can access the options to format cells in a spreadsheet program.

While these help us understand the observations we want to note, there are many ways to actually input that data. So the second part of data types are actually what we put into those cells and how they are stored. This part of data types fall into a number of categories. If we’ve spent any time in the format cells section of our spreadsheet program, we may notice that under categories there exist a number of ways to allow things to be displayed. These can include different kinds of numeric values, such as percentages versus numbers without decimal values; dates, which are the worst; boolean, which refers to whether a thing is true or false; and the often used terrible to work with text.

These are all the possible ways we can store or format the information within a cell

The main categories that archaeological notes will fall into are either: numeric and text. There are more specific things, like those visible in the photo and listed previously, but we’ll explore those in more detail later on. They also require

Filling in the table

Now that we’ve decided the structure of the table we can go ahead and put the information that we wanted to gather into it. And as a test to see if that structure makes sense, go back to the beginning of the tutorial and download a few more context sheets. That way we can see if how we organized the data makes sense or if we’ll need to edit it in some fashion.

Or instead of repeating those steps we can just click this link and pick some other context sheets to enter in.

We should do this a number of times so that we really test out the data structure that we created. Also so that we have enough observations to play around and possibly answer some archaeological questions. For the purposes of this tutorial, let’s try to input at least ten context sheets. It’s a lot but hopefully the structure

How to fill in the table

So we talked about deciding data types previously and now that we’re actually entering data (yay for data entry! Get some headphones and listen to things during this portion!) we want to try to keep what we actually enter in as standard as possible.

On that note: copy and paste is our friend. Copy and Paste might, in fact, be our best friend. The friend who keeps the notes we wrote them in high school and cherishes them forever. Or burns them because they know if those ever got out no one would respect our work as archaeologists. The author of this tutorial is definitely not writing from personal experience.

That aside, for any fields that contain text, referred to in the biz as strings using copy and paste to ensure consistency is important. However, depending on where we’re doing our data entry, for example into a database, we can set things up in different ways to try to prevent certain kinds of data entry. That also can be tricky though if we don’t know how diverse our data structures are going to be from the beginning. Copy and Paste will work for now though.

Similarly, we want to make sure that our numbers are entered the same way, and that we don’t do such things as um enter numbers as text. Which means if we were to try to do any statistics, we’d be trying to average 1 and ‘one,’ which isn’t possible. This is something we can set in the formatting section of a column in many spreadsheet programs.

Particularly for the section labeled Elevation. This is definitely a number and it looks like this was consistently measured to multiple digits passed the decimal place. Sometimes it’s three places, sometimes it’s four. However, we want to make sure we capture the most detail so we might want to set the formatting to always show us four places passed the decimal.

A tabular dataset, Now what?

Congratulations! We’ve made some data entry decisions. This means that we’ve created a structure for our data Now that we’ve added in more than one set of rows, entries, or observations we can see that we’re actually building a data set. We can utilize the spreadsheet skills that we’ve developed in previous tutorial to explore something specific about the data.

After a few observations, take stock and see if we can pick out any particular patterns in the context sheets that we collected. Once we have a few entries input, try filtering the data to see if there’s something that we can say about the site based on the information we’ve put into the table. If we’re unfamiliar with filtering in our spreadsheet program, we can check out the spreadsheet tutorial section of Cow-cultating your data with spreadshsets and R.

Are there any particular words that get repeated in particular headings? Do some elevations have more things in them than others?

We can use these patterns to now build a narrative about the collection of observations we have from these contexts. So head onto the narrative portion of the tutorial to see what we decided to pick out of our context sheets. However, narrative building is the next section of our tutorial so hold off on that for just a second.

Particularly because we may have noticed that, due to the headings we selected, we’re leaving some cells blank. Which can sometimes be a cause for concern when we’re interested interpeting an archaeological site.

Congratulations! We’ve made, or are in the process of, making our very own data set. It’s not the largest data set out there but hopefully we’ve recorded enough observations to feel like we’ve got at handle on how to turn notes into data. And if we haven’t there are 180+ context sheets we can look at to keep building it, should we want to keep doing data entry.

Which, we know everyone does…not. Once a few are entered and we’ve, as previously mentioned, started to play around with what’s been entered, looking for patterns, we will probably want to do a bit more with these. As this is only the first part of the cycle. We’ve turned our structured notes into data. But we don’t really want data, we want information or even beyond that knowledge.

Do get to that we’ll need to move to the next section of this tutorial. Turning those notes back into a narrative to say something about the site.