What is challenging about data processing? Analysis of Data processing difficulties using Stack Overflow


Correct and error-free data processing is essential for data-driven software products (e.g., AI-enabled systems) to make accurate predictions. However, recent studies [1,2,4] found that developers face several difficulties when implementing data processing logic.  To improve this situation, a deeper understanding of these difficulties is needed. Therefore, the aim of this thesis is to investigate what data operations (e.g., standardization, scaling) and tasks (e.g., data cleaning, transformation, integration) are the most challenging for developers. To answer this, Q&A posts related to common data processing libraries (e.g., pandas, scikit-learn) on Stack Overflow should be analyzed.



Analysis of data processing-related posts on Stack Overflow

Identification and classification of data processing-related difficulties

Presenting the obtained results and findings



  • Review literature on analyzing Stack Overflow posts
  • Develop questions that should be answered
  • Development of a concept to conduct the analysis
    • focus on e.g., python data processing libraries pandas, scikit learn preprocessing package
    • define tags used for analysis
    • extracting and preprocessing posts
    • which questions to include (e.g., qualitative sample)
    • labeling approach (e.g., manual)
  • Conduct the empirical analysis
  • Analysis & Evaluation of the results
  • Presenting the results



[1] Wang, Z. (2021). Understanding the Challenges and Assisting Developers with Developing Spark Applications.

[2] Islam et al. (2019). What Do Developers Ask About ML Libraries? A Large-scale Study Using Stack Overflow.

[3] Bagherzadeh, Khatchadourian (2019). Going big: A large-scale study on what big data developers ask.

[4] Alshangiti et al. (2019). Why is Developing Machine Learning Applications Challenging? A Study on Stack Overflow Posts.

[5] Bajaj et al. (2014). Mining questions asked by web developers.