What's Taught In a Doctoral Data Engineering Course?

Bryan Christ
October 12, 2022

Assistant Professor of Data Science Jon Kropko is an advocate for graduate education and making data science accessible to everyone. He is not new to teaching graduate students and serves as co-program director of the M.S. in Data Science Online program. He is also excited to work with doctoral students who are starting their first year in the new Ph.D. in Data Science. We sat down with him to learn about the Data Engineering course he is teaching.

Q: What are the key differences between a data engineering course at the Ph.D. vs. master's level? 

Both courses are about the data pipeline, which is about where you get your data from, how you get it onto your computer and into a place where you could use Python to manipulate it, and then use the data to make some sort of projects like data visualization or dashboard. 

The difference between the master's level and the Ph.D. level coursework has to do with the end goal. At the master's level, the goal is to learn how to work through the data science pipeline on your own machine. With the Ph.D. class, it's not only about learning how to do it but also learning how to make it reproducible as part of a dissertation or research agenda. 

For example, say you're doing quantitative research and you are collecting data you want to make a product with it. Maybe that product is a deployed machine learning model that will be used to make predictions in the field or a web-enabled dashboard that's sharing visualizations and statistics from the data. Whatever the product is, it's not enough just to make the thing. You also have to be ready to share all of your code and the software that was needed to deploy what you have built on another computer so that peer reviewers can review your work. The goal is to avoid problems like your code only working on a Mac or only with a specific version of Python. The end goal is to master the data pipeline in a way that prepares you to do research at the highest level. 

Q: What skills and concepts will Ph.D. students learn in your course?  

In an introductory statistics course or an introductory Python course, you often get clean data, and then professors show you how to do things like regression. This doesn't happen in the real world. With clean datasets, you don't have to worry about what the columns are named if you need to recode the categories, if you need to query the data from a database, or if you need to merge or reshape the data set. The truth is, regardless of whether you're working in industry or as an academic researcher, the time-consuming labor comes from finding the data in the first place, getting it into a form where you could do any Python analyses at all on it, and then manipulating it so something like a machine learning model becomes possible. There's a ton of work that is necessary to prepare data to get to that level. 

In the Data Engineering course, my goal is to prepare doctoral students to do that cleaning work which is a prerequisite to doing machine learning. The other skill I teach is helping students communicate results from their analyses. Once you do machine learning, you've got the results. But what do you do with them? You need to make something any client—no matter what level of technical ability or knowledge they have—can understand and use. I focus on teaching how you do data visualization, how you build a dashboard, and how you can talk about your end product. So I teach everything that comes before and after machine learning, which is really awesome. 

Q. Can you talk a little bit about the final project students will submit?  

Throughout the Data Engineering course, I am doing a project in front of the students and then asking each of them to do their own project on a different topic. My topic is building a dashboard to foster transparency in Congress. It can be very confusing for a citizen to know what their representatives and senators are doing in Congress and answer questions like, "How are they voting? What bills are they sponsoring? What committees are they on? What sort of an impact is this person having as your representative? Who are they working with?" 

All of the answers to these questions are public information, but it's just hard to find for somebody without both the technical skill to access multiple online data sets and the knowledge of what information they should be looking for. So, I want to take all this information and put it in one dashboard where you can look at any representative or senator and see how they voted on any bill, what bills they sponsored, who they sponsored with most often, and how they stack up ideologically, 

This is all public data. I'm just trying to collect it using the skills that are part of the Data Engineering course like working with flat files, APIs, and web scraping to collect the data. I then organize it using databases and query it to communicate the results in my dashboard. Each of the students will work on their own project as well where they have to choose a topic that is relevant to their research or of personal interest. They need to find sources of data on the internet. They need to get that data into a Python environment. They then need to organize and clean the data, and then they need to create something for mass consumption.  

Q. What are the key differences in the learning environment and pedagogical approach for Ph.D. students?  

You can expect more from Ph.D. students in terms of their ability to get themselves unstuck and really work on a problem until they figure out how to solve it on their own computer. You can trust them to be self-motivated and self-directed.  

This gets at a larger philosophical difference between how to engage with doctoral students in a classroom vs. how to engage with undergraduate and master's students in the classroom. Ph.D. students aren't really students in the same way as undergraduate and master's students—they're more like peers. Professors often think of them as junior members of the same research enterprise where they too are a member. 

Another difference is the hierarchy between teacher and student is less strict. It's more of a mentoring relationship. For example, for each class I set a goal like, "Today, I want to get my Github and Docker set up and have them work together." Then, I just walk them through the steps with everybody doing it on their own computers. If anybody has an issue or problem they just say so and we pause class and solve it together. 

The reason why we can do that and at such a pace is because faculty trust Ph.D. students to figure things out as problems come up. It also helps the small class sizes are smaller. My Data Engineering course only has five students right now which is beautiful. 

Q: What do you enjoy most about teaching Ph.D. students?  

I appreciate that with Ph.D. students you can strip away all of the titles and typical social norms between student and professor. Instead, you get very focused on the intellectual content. You know you're just a group of smart people all working on a hard problem together and, to me, that is academia working in the greatest possible way. 

Q: How will the Data Engineering 2 course be built?  

The topics we will get into next will include a deep dive into cloud computing for big data. We're going to introduce cloud computing in Data Engineering, but we'll do a deep dive in the next class. So, when we talk about big data, it means data and data applications that are so big that we can't involve a local computer at all. The question then becomes, "How do we store and analyze the data in a way that fits our budget?" This will be the focus of the next course. 

Q. How do you go about developing a course; what is your process?  

When I am developing a new course, I try to think of it as a story or a narrative. Teaching is an exercise in memory and attention, which is why it should have a narrative arc because that helps a great deal with memory. 

For example, if each topic builds in a very natural way off of the last topic that we covered, it helps people to remember each topic because it's being brought up again and again, week after week. For example, if I were trying to teach you how to walk, the first week I would teach you how to put on shoes. The next week I would teach you how to take a couple of steps. But first, you have to put on the shoes in order to take a few steps, so you encounter the topic of putting on your shoes twice. This helps you remember. 

It's also motivating for students to have a narrative arc because they can see where it's going and where each individual topic plays into the end goal of the class. I think where classes tend to fall apart is when they become a hodgepodge collection of individual topics that don't lead anywhere. Anytime I teach a class, I try to find that unifying story and keep the focus on the end goal. 

Q: Is there anything else you want to share? 

I just think it's awesome that we have Ph.D. students. The most amazing thing about doctoral students is that these are individuals who could completely change the field where I, as a faculty member, am supposedly an expert.