)]}'
{"/PATCHSET_LEVEL":[{"author":{"_account_id":1000001,"name":"Lorenz Brun","display_name":"Lorenz","email":"lorenz@monogon.tech","username":"lorenz","avatars":[{"url":"https://www.gravatar.com/avatar/75c04f6e9881c24ee621fba80667eed8.jpg?d\u003didenticon\u0026r\u003dpg\u0026s\u003d32","height":32},{"url":"https://www.gravatar.com/avatar/75c04f6e9881c24ee621fba80667eed8.jpg?d\u003didenticon\u0026r\u003dpg\u0026s\u003d56","height":56},{"url":"https://www.gravatar.com/avatar/75c04f6e9881c24ee621fba80667eed8.jpg?d\u003didenticon\u0026r\u003dpg\u0026s\u003d100","height":100},{"url":"https://www.gravatar.com/avatar/75c04f6e9881c24ee621fba80667eed8.jpg?d\u003didenticon\u0026r\u003dpg\u0026s\u003d120","height":120}]},"change_message_id":"85d0df754aff925a75b944007195e111ed5ecf0c","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":12,"id":"081d72a2_ab5c19ff","updated":"2021-07-20 11:04:58.000000000","message":"LGTM","commit_id":"81334b553eedb1a334235aca0337bdb075b1967a"},{"author":{"_account_id":1000002,"name":"Serge Bazanski","display_name":"Serge","email":"serge@monogon.tech","username":"serge","avatars":[{"url":"https://www.gravatar.com/avatar/52c41428b6369f2c02b9717425216f7d.jpg?d\u003didenticon\u0026r\u003dpg\u0026s\u003d32","height":32},{"url":"https://www.gravatar.com/avatar/52c41428b6369f2c02b9717425216f7d.jpg?d\u003didenticon\u0026r\u003dpg\u0026s\u003d56","height":56},{"url":"https://www.gravatar.com/avatar/52c41428b6369f2c02b9717425216f7d.jpg?d\u003didenticon\u0026r\u003dpg\u0026s\u003d100","height":100},{"url":"https://www.gravatar.com/avatar/52c41428b6369f2c02b9717425216f7d.jpg?d\u003didenticon\u0026r\u003dpg\u0026s\u003d120","height":120}]},"change_message_id":"28bb54342ad1e3406123fcdf26d3b1be86d28063","unresolved":false,"context_lines":[],"source_content_type":"","patch_set":12,"id":"17100dcb_3dad2d26","updated":"2021-07-19 09:23:13.000000000","message":"Ping.","commit_id":"81334b553eedb1a334235aca0337bdb075b1967a"}],"metropolis/node/core/debug_service.go":[{"author":{"_account_id":1000001,"name":"Lorenz Brun","display_name":"Lorenz","email":"lorenz@monogon.tech","username":"lorenz","avatars":[{"url":"https://www.gravatar.com/avatar/75c04f6e9881c24ee621fba80667eed8.jpg?d\u003didenticon\u0026r\u003dpg\u0026s\u003d32","height":32},{"url":"https://www.gravatar.com/avatar/75c04f6e9881c24ee621fba80667eed8.jpg?d\u003didenticon\u0026r\u003dpg\u0026s\u003d56","height":56},{"url":"https://www.gravatar.com/avatar/75c04f6e9881c24ee621fba80667eed8.jpg?d\u003didenticon\u0026r\u003dpg\u0026s\u003d100","height":100},{"url":"https://www.gravatar.com/avatar/75c04f6e9881c24ee621fba80667eed8.jpg?d\u003didenticon\u0026r\u003dpg\u0026s\u003d120","height":120}]},"change_message_id":"77cebe49aa49418256fe9df30aaa3234faa39116","unresolved":true,"context_lines":[{"line_number":57,"context_line":"\t\t\treturn nil, status.Errorf(codes.Unavailable, \"could not get roleserve status: %v\", err)"},{"line_number":58,"context_line":"\t\t}"},{"line_number":59,"context_line":"\t\tif v.Kubernetes \u003d\u003d nil {"},{"line_number":60,"context_line":"\t\t\tcontinue"},{"line_number":61,"context_line":"\t\t}"},{"line_number":62,"context_line":"\t\treturn v.Kubernetes.GetDebugKubeconfig(ctx, req)"},{"line_number":63,"context_line":"\t}"}],"source_content_type":"text/x-go","patch_set":11,"id":"9f31b01f_6cd1c67e","line":60,"updated":"2021-07-07 16:30:43.000000000","message":"Should GetDebugKubeconfig stall indefinitely if a node doesn\u0027t have the Kubernetes role configured? IMO it should return FAILED_PRECONDITION or something similar. If the roleserver doesn\u0027t have role information it\u0027s fine to hang, but as soon as it does this call should either succeed or return an error.","commit_id":"a5d829b3f528599d4d28946cfc166525b60293f2"},{"author":{"_account_id":1000002,"name":"Serge Bazanski","display_name":"Serge","email":"serge@monogon.tech","username":"serge","avatars":[{"url":"https://www.gravatar.com/avatar/52c41428b6369f2c02b9717425216f7d.jpg?d\u003didenticon\u0026r\u003dpg\u0026s\u003d32","height":32},{"url":"https://www.gravatar.com/avatar/52c41428b6369f2c02b9717425216f7d.jpg?d\u003didenticon\u0026r\u003dpg\u0026s\u003d56","height":56},{"url":"https://www.gravatar.com/avatar/52c41428b6369f2c02b9717425216f7d.jpg?d\u003didenticon\u0026r\u003dpg\u0026s\u003d100","height":100},{"url":"https://www.gravatar.com/avatar/52c41428b6369f2c02b9717425216f7d.jpg?d\u003didenticon\u0026r\u003dpg\u0026s\u003d120","height":120}]},"change_message_id":"d964e699faa3da6108660841bd24b71af47febe3","unresolved":true,"context_lines":[{"line_number":57,"context_line":"\t\t\treturn nil, status.Errorf(codes.Unavailable, \"could not get roleserve status: %v\", err)"},{"line_number":58,"context_line":"\t\t}"},{"line_number":59,"context_line":"\t\tif v.Kubernetes \u003d\u003d nil {"},{"line_number":60,"context_line":"\t\t\tcontinue"},{"line_number":61,"context_line":"\t\t}"},{"line_number":62,"context_line":"\t\treturn v.Kubernetes.GetDebugKubeconfig(ctx, req)"},{"line_number":63,"context_line":"\t}"}],"source_content_type":"text/x-go","patch_set":11,"id":"9e133b25_219ea6e3","line":60,"in_reply_to":"22d75339_cdf2eb98","updated":"2021-07-20 09:42:54.000000000","message":"1. Why is the roleserver tracking actual state instead of intent, at least for roles?\n\nThe roleserver itself is tracking intent (by lookint at roles in the curator), and implementing state (by starting/stopping services). However, it is exporting, in its Watch API, only the newest effect of that reconciliation, ie. the effective state. This is because currently all consumers of this Watch API really only care about the state, ie. the debug service wants to get a handle to the Kubernetes service if present, and doesn\u0027t care about the intent. So it\u0027s purely driven by what the current code using it expected, and some vaguely educated guesses about future needs.\n\n2. How easy would it be to change that later?\n\nWe can add intent as patch of the watch structure if it\u0027s really needed, as I don\u0027t think I\u0027ll be strongly building upon the roleserver serving only intent any time soon. And if I do, we can return to this discussion and make a decision, hopefully with some more knowledge by then already.\n\nIf it happens that someone ever hits the \u0027I asked for a debug kubeconfig and the damn things just hang on me without any clue as to why\u0027, we can: \na) add the above, thereby providing intent as well as state in the roleserver watch struct,\nb) modify GetDebugKubeconfig so that it returns FAILED_PRECONDITION until the intent is to run Kubernetes, then return UNAVAILABLE until the Kubernetes service is actually running, and then actually return a debug kubeconfig.","commit_id":"a5d829b3f528599d4d28946cfc166525b60293f2"},{"author":{"_account_id":1000001,"name":"Lorenz Brun","display_name":"Lorenz","email":"lorenz@monogon.tech","username":"lorenz","avatars":[{"url":"https://www.gravatar.com/avatar/75c04f6e9881c24ee621fba80667eed8.jpg?d\u003didenticon\u0026r\u003dpg\u0026s\u003d32","height":32},{"url":"https://www.gravatar.com/avatar/75c04f6e9881c24ee621fba80667eed8.jpg?d\u003didenticon\u0026r\u003dpg\u0026s\u003d56","height":56},{"url":"https://www.gravatar.com/avatar/75c04f6e9881c24ee621fba80667eed8.jpg?d\u003didenticon\u0026r\u003dpg\u0026s\u003d100","height":100},{"url":"https://www.gravatar.com/avatar/75c04f6e9881c24ee621fba80667eed8.jpg?d\u003didenticon\u0026r\u003dpg\u0026s\u003d120","height":120}]},"change_message_id":"205b59604582cb931ae9c89d9eba09a68337eb9c","unresolved":true,"context_lines":[{"line_number":57,"context_line":"\t\t\treturn nil, status.Errorf(codes.Unavailable, \"could not get roleserve status: %v\", err)"},{"line_number":58,"context_line":"\t\t}"},{"line_number":59,"context_line":"\t\tif v.Kubernetes \u003d\u003d nil {"},{"line_number":60,"context_line":"\t\t\tcontinue"},{"line_number":61,"context_line":"\t\t}"},{"line_number":62,"context_line":"\t\treturn v.Kubernetes.GetDebugKubeconfig(ctx, req)"},{"line_number":63,"context_line":"\t}"}],"source_content_type":"text/x-go","patch_set":11,"id":"22d75339_cdf2eb98","line":60,"in_reply_to":"314da7dd_3b3f71c1","updated":"2021-07-19 16:10:25.000000000","message":"IMO returning FAILED_PRECONDITION and having the caller retry is not a bad idea. But I can see how it might be more convenient and performant to have the RPC wait for the data to be returned. The issue I have with doing this is it returns no information in case an unexpected condition arises (for example if I\u0027m debugging Metropolis I can\u0027t distinguish between a network issue, a quorum issue or the fact that no K8s role is configured). Having a client retry on FAILED_PRECONDITION is not an uncommon operation and allows us to log what precondition failed and allows each client to set reasonable retry policies for itself.\n\nI\u0027m still not sure if it\u0027s a good idea that the roleserver doesn\u0027t hold intent, but I don\u0027t think I should hold up landing this stack for this. So two questions:\n1. Why is the roleserver tracking actual state instead of intent, at least for roles?\n2. How easy would it be to change that later?","commit_id":"a5d829b3f528599d4d28946cfc166525b60293f2"},{"author":{"_account_id":1000001,"name":"Lorenz Brun","display_name":"Lorenz","email":"lorenz@monogon.tech","username":"lorenz","avatars":[{"url":"https://www.gravatar.com/avatar/75c04f6e9881c24ee621fba80667eed8.jpg?d\u003didenticon\u0026r\u003dpg\u0026s\u003d32","height":32},{"url":"https://www.gravatar.com/avatar/75c04f6e9881c24ee621fba80667eed8.jpg?d\u003didenticon\u0026r\u003dpg\u0026s\u003d56","height":56},{"url":"https://www.gravatar.com/avatar/75c04f6e9881c24ee621fba80667eed8.jpg?d\u003didenticon\u0026r\u003dpg\u0026s\u003d100","height":100},{"url":"https://www.gravatar.com/avatar/75c04f6e9881c24ee621fba80667eed8.jpg?d\u003didenticon\u0026r\u003dpg\u0026s\u003d120","height":120}]},"change_message_id":"85d0df754aff925a75b944007195e111ed5ecf0c","unresolved":false,"context_lines":[{"line_number":57,"context_line":"\t\t\treturn nil, status.Errorf(codes.Unavailable, \"could not get roleserve status: %v\", err)"},{"line_number":58,"context_line":"\t\t}"},{"line_number":59,"context_line":"\t\tif v.Kubernetes \u003d\u003d nil {"},{"line_number":60,"context_line":"\t\t\tcontinue"},{"line_number":61,"context_line":"\t\t}"},{"line_number":62,"context_line":"\t\treturn v.Kubernetes.GetDebugKubeconfig(ctx, req)"},{"line_number":63,"context_line":"\t}"}],"source_content_type":"text/x-go","patch_set":11,"id":"ddff14cb_1eb3e9d6","line":60,"in_reply_to":"9e133b25_219ea6e3","updated":"2021-07-20 11:04:58.000000000","message":"Thanks for the explaination, that makes sense. I think this is good to go like this, it is easy enough to change this if we see a need.","commit_id":"a5d829b3f528599d4d28946cfc166525b60293f2"},{"author":{"_account_id":1000002,"name":"Serge Bazanski","display_name":"Serge","email":"serge@monogon.tech","username":"serge","avatars":[{"url":"https://www.gravatar.com/avatar/52c41428b6369f2c02b9717425216f7d.jpg?d\u003didenticon\u0026r\u003dpg\u0026s\u003d32","height":32},{"url":"https://www.gravatar.com/avatar/52c41428b6369f2c02b9717425216f7d.jpg?d\u003didenticon\u0026r\u003dpg\u0026s\u003d56","height":56},{"url":"https://www.gravatar.com/avatar/52c41428b6369f2c02b9717425216f7d.jpg?d\u003didenticon\u0026r\u003dpg\u0026s\u003d100","height":100},{"url":"https://www.gravatar.com/avatar/52c41428b6369f2c02b9717425216f7d.jpg?d\u003didenticon\u0026r\u003dpg\u0026s\u003d120","height":120}]},"change_message_id":"325419772ae950a8518de204eb76755629f121ec","unresolved":true,"context_lines":[{"line_number":57,"context_line":"\t\t\treturn nil, status.Errorf(codes.Unavailable, \"could not get roleserve status: %v\", err)"},{"line_number":58,"context_line":"\t\t}"},{"line_number":59,"context_line":"\t\tif v.Kubernetes \u003d\u003d nil {"},{"line_number":60,"context_line":"\t\t\tcontinue"},{"line_number":61,"context_line":"\t\t}"},{"line_number":62,"context_line":"\t\treturn v.Kubernetes.GetDebugKubeconfig(ctx, req)"},{"line_number":63,"context_line":"\t}"}],"source_content_type":"text/x-go","patch_set":11,"id":"314da7dd_3b3f71c1","line":60,"in_reply_to":"9f31b01f_6cd1c67e","updated":"2021-07-08 11:14:24.000000000","message":"With the current roleserver watch interface status.Kubernetes \u003d\u003d nil doesn\u0027t necessarily mean that the node does not have a Kubernetes role. It might also mean that the the roleserver just configured some other role for the node and fired an event for that (and in a subsequent update .Kubernetes will also be set, but not yet). This implementation could be changed, but that would make it more complex, as the roleserver status would have to also carry intent, not just status.\n\nAdditionally, this is supposed to handle cases where a node has just recently been configured to have a Kubernetes role attached to it, but that role intent has not yet been propagated to the node (etcd slowness, network issues, ...), or the Kubernetes role launcher has not finished setting up yet (slow service startup). In those cases, callers would have to implement retry logic to handle \u0027spurious\u0027 FAILED_PRECONDITION errors and wait for propagation. Even if we change the roleserver implementation as above (to also be aware of intent), there could still be race windows in which a node is reconfigured to run the Kubernetes role but the roleserver has not yet updated its view of the intent.\n\nI think that if callers ask for a DebugKubeconfig, they do that because they know this node is supposed to have one, and blocking until one is truly available is the expected behaviour. There might be separate calls for retrieving role intent (at a curator level) and debugging roleserver state (at a node level), but this RPC should assume that the user\u0027s intent is to get a kubeconfig, not figure out if the node is configured/behaving in a way that\u0027s expected.\n\nIf this reasoning makes sense to you, I can codify this in the proto. Otherwise let\u0027s keep discussing how to approach this.","commit_id":"a5d829b3f528599d4d28946cfc166525b60293f2"}]}
