As machine learning systems enter the open world, their accountability becomes a high priority problem. Accountability requires deep understanding of system behavior and its failures. Characterization of failures and shortcomings is particularly complex for systems composed of multiple machine learned components. In this talk, I will discuss our work on troubleshooting and in-depth failure analysis for such systems. First, I will present a methodology for applying counterfactual analysis with humans in the loop for the purpose of understanding which component fixes are the most effective ones for a given system architecture. Second, I will describe the functionalities and tools that we have built for detailed system performance analysis and prediction. Both works will be illustrated via a real-world case study example of a machine learning pipeline for image captioning.