Skip to content
Snippets Groups Projects
  • Yury Umanets's avatar
    8c981415
    b=17310 · 8c981415
    Yury Umanets authored
    r=johann,shadow
    
    - fixes ptlrpcd blocking on very long reply unlink waiting. To do so new rpc phase introduced
    RQ_PHASE_UNREGISTERING in which request stay until we have reply_in_callback() called by lnet
    signaling that reply is unlinked. All requests in this state are skipped in processing by prlrcd
    instead of waiting n * 300s on each of them. This allows ptlrpcd to process other rpcs in the set;
    
    - make sure that inflight count is coherent with being present on sending or delay list. That is,
    if we see inflight != 0, rpc must be on one of these lists. This is very helpful in
    ptlrpc_invalidate_import() to show all rpcs still waiting after invalidating import;
    
    - in ptlrpc_invalidate_import() wait maximal rq_deadline - now from all inflight rpcs instead of
    obd_timeout which may be much longer. If calculated timeout is 0, obd_timeout is used. This fixes
    the issue that rq_deadline - now > obd_timeout (very easy to see in logs) which led to inflight !=
    0 assert because inflight rpcs timed out later than our wait period is finished;
    
    - in ptlrpc_invalidate_import() wait forever for rpcs in UNREGISTERING phase. Check in assert for
    inflight == 0 for wait timed out case if no rpcs in UNREGISTERING phase. Only those in
    UNREGISTERING phase are allowed to stay longer than obd_timeout;
    
    - added ptlrpc_move_rqphase() function. All phase changes go through it. Add debug_req() there to
    track down all phase changes;
    
    - conf_sanity.sh test_45 added to emulate very long reply unlink and also situation when
    rq_deadline - now > obd_timeout;
    
    - do not wait forever in ptlrpc_unregister_reply() for async case (using it from sets). sync case
    left unchanged;
    
    - make sure that ptlrpc_set_next_timeout() yields 1s timeout (instead of 0s) for the set with rpcs
    in "unregistering" stage to prevent ptlrpcd from sleeping forever and hanging in test_45;
    
    - in ptlrpcd() make sure that we do not sleep on 0 timeout.
    8c981415
    History
    b=17310
    Yury Umanets authored
    r=johann,shadow
    
    - fixes ptlrpcd blocking on very long reply unlink waiting. To do so new rpc phase introduced
    RQ_PHASE_UNREGISTERING in which request stay until we have reply_in_callback() called by lnet
    signaling that reply is unlinked. All requests in this state are skipped in processing by prlrcd
    instead of waiting n * 300s on each of them. This allows ptlrpcd to process other rpcs in the set;
    
    - make sure that inflight count is coherent with being present on sending or delay list. That is,
    if we see inflight != 0, rpc must be on one of these lists. This is very helpful in
    ptlrpc_invalidate_import() to show all rpcs still waiting after invalidating import;
    
    - in ptlrpc_invalidate_import() wait maximal rq_deadline - now from all inflight rpcs instead of
    obd_timeout which may be much longer. If calculated timeout is 0, obd_timeout is used. This fixes
    the issue that rq_deadline - now > obd_timeout (very easy to see in logs) which led to inflight !=
    0 assert because inflight rpcs timed out later than our wait period is finished;
    
    - in ptlrpc_invalidate_import() wait forever for rpcs in UNREGISTERING phase. Check in assert for
    inflight == 0 for wait timed out case if no rpcs in UNREGISTERING phase. Only those in
    UNREGISTERING phase are allowed to stay longer than obd_timeout;
    
    - added ptlrpc_move_rqphase() function. All phase changes go through it. Add debug_req() there to
    track down all phase changes;
    
    - conf_sanity.sh test_45 added to emulate very long reply unlink and also situation when
    rq_deadline - now > obd_timeout;
    
    - do not wait forever in ptlrpc_unregister_reply() for async case (using it from sets). sync case
    left unchanged;
    
    - make sure that ptlrpc_set_next_timeout() yields 1s timeout (instead of 0s) for the set with rpcs
    in "unregistering" stage to prevent ptlrpcd from sleeping forever and hanging in test_45;
    
    - in ptlrpcd() make sure that we do not sleep on 0 timeout.