In 4812bc39e (Add the SHA256 of the ssh key, 2016-05-28) the SHA256
fingerprints were added to the wrong section. The stg.pagure.io
fingerprint is in the pagure.io section and vice versa. The MD5
fingerprints are correct.
This can be confirmed by checking the output of ssh-keygen with the SSH
pubkey values for each host:
$ for i in {stg.,}pagure.io.pub; do echo $i; cat $i; for hash in sha256 md5; do ssh-keygen -l -E $hash -f $i; done; echo; done
stg.pagure.io.pub
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDJNu490Rp305zGCJLvhVIrKjL7Xngew3NxgRYeopHBDvj+EFQUqULXtgrI5nUBMSB94RrsuHynFAXYy2m0snHjWzWjbIxM4ZVD2sX4GiKX6qu7WyxcGmGcL08MF919r+JSPL9oWWSq/CvvBF0M1eeqkIpjMZHpVKgR3uTMD5yW994NBLAQi9i1UdwGYNQc1KqWvlvW1XhFFtiIGscIFGRKsUOMvnJvWdU6T+djmzMy4hcahxnsPCZxCjbQpuH1JjihNNVWYOq7Ztjs1gxpTTV19ATp4Z2F95uyyQ3Y+Em9KeXcKXYxwVzYVho5SSB1ZYBL+xAH1osK23PvGD39UYp9
2048 SHA256:x4xld/tPdeOhbyJcTOxd+IbSZ4OpnBzh/IskocyrOME stg.pagure.io.pub (RSA)
2048 MD5:69:50:46:24:c7:94:44:f8:8d:83:05:5c:eb:73:fb:c4 stg.pagure.io.pub (RSA)
pagure.io.pub
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC198DWs0SQ3DX0ptu+8Wq6wnZMrXUCufN+wdSCtlyhHUeQ3q5B4Hgto1n2FMj752vToCfNTn9mWO7l2rNTrKeBsELpubl2jECHu4LqxkRVihu5UEzejfjiWNDN2jdXbYFY27GW9zymD7Gq3u+T/Mkp4lIcQKRoJaLobBmcVxrLPEEJMKI4AJY31jgxMTnxi7KcR+U5udQrZ3dzCn2BqUdiN5dMgckr4yNPjhl3emJeVJ/uhAJrEsgjzqxAb60smMO5/1By+yF85Wih4TnFtF4LwYYuxgqiNv72Xy4D/MGxCqkO/nH5eRNfcJ+AJFE7727F7Tnbo4xmAjilvRria/+l
2048 SHA256:Gddkd5H7oQ1RaK8WgXSKl7JZP+FgLyidmxbLercJ/JY pagure.io.pub (RSA)
2048 MD5:90:8e:7f:a3:f7:f1:70:cb:56:77:96:17:44:c4:fc:82 pagure.io.pub (RSA)
RCA of issue sent on IRC:
It's a very interresting edge case and related to my previous diag.
In short: both the pagure main app and pagure docs app were using the same process pool (WSGIProcessDaemon).
As soon as they would both be loaded in the same thread, they would both load the FFI (C wrapper) code, and only the
latest process to load it would still have valid type references, the other would start sending wrong references,
which causes it to error out (correctly), because it doesn't know the types it got.
So basically, the fix I just applied is put pagure docs into its own WSGI daemon process, that keeps them nicely separated.
the reason that this didn't hit in staging and why it also worked *sometimes* in production is that it would only crash if:
1. both pagure main app and docs app were loaded in the thread that's used for the current request
2. pagure docs app was loaded last in the current thread, overriding the types for pagure main app, and
3. we have 4 processes with 4 threads each, so each request gets into one of 16 threads, making the staging
not likely to hit the previous two conditions, but prod has so many requests it's likely to hit 1 and 2