Many Kinds of Minds are Better than One: Value Alignment Through Dialogue Sanjay Modgil Department of Informatics, King’s College London Recent successes in artificial intelligence (AI) have in large part been due to advances in machine learning, and have been accompanied by leading researchers warning of the possible dangers of AI [1,17]. It is argued that the forseeable benefits of AI will license the develop- ment of, and trust in, machines that are increasingly more powerful (with cognitive powers far outstripping those of humans), autonomous and capable of acting in diverse and open environ- ments. However, such machines may formally achieve their operator’s goals in ways that not only diverge from their operators intentions, but may actually be contrary to the interests and values of their operators [1,17,21]. This concern recalls arguments to the effect that adhering to any rule based ethical system may result in unintended, harmful consequences (as exemplified by Asimov’s laws of robotics [14,15]). However, this problem has acquired renewed urgency given that it is a feature of learning systems that they find unforseen ways of achieving goals, and that achievement of any operator’s goal will incentivise ‘instrumental goals’ (such as self- preservation) that thwart corrective measures to prevent harm [1,20,21,19]. The need to ensure AI acts in accordance with human values has prompted considerable intellectual investment into what in a machine learning context has been termed the ‘value loading (alignment) problem’ [1,21], more broadly understood as the problem of how to design ‘ethical agents 1 . Whether the envisaged agents’ ethical behaviour is implemented through use of machine learning tech- niques via the maximising of utility functions encoding human preferences, and/or through the use of ‘top down’ symbolic logic based reasoning adhering to explicitly encoded ethical theories [12,23], two key research problems need to be addressed [1,16,21]. Firstly, there is the problem of how to specify objective utility functions (deontic axiomatisations) that are perfectly aligned with human values and applicable in changing environments and to novel situations (in particu- lar ethical challenges that lack precedent and thus most saliently expose the is/ought gap, such as those arising from the use of radically new technologies). Secondly, there is the above de- scribed problem of unintended behaviours misaligned with human values. Run time learning of values has been proposed to address these problems [19], for example through the use of inverse reinforcement learning [13] in which AI systems are incentivised to observe and query humans [16]; the assumption being that actions reveal preferences and hence values, and that humans are sufficiently informed and have the requisite capacity to definitively arbitrate on matters of ethical importance. However, humans clearly do not always behave ethically, and moreover are often uncertain about how to resolve ethical issues; in particular those arising from the use of 1 Note the assumption that however advanced the AI (including the ‘super-intelligent’ machines whose developmental trajectory Bostrom rigorously charts once human level artificial general intelligence is achieved), deciding on moral issues in isolation, cannot, even in principle, always align with the moral decision making of humans. Since whether one is a Aristotliean type virtue ethicist, a Kantian deontologist, or a consequentialist, one must neccessarily access (reports of) first person subjective experience in deciding what oughts pertain when faced with ethically challenging issues.This is most explicitly acknowledged by the utilitarian account of consequentialism, which advocates ethical choices that impartially maximise total happiness, where happiness is broadly construed as the subjective experience of well being.